Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: Please add options for incremental training. (Code2Text) #23

Closed
Manas-Embold opened this issue Dec 3, 2020 · 17 comments
Closed

Comments

@Manas-Embold
Copy link

Hi
Please add option for incremental training, so that its possible to train on colab or similar platforms.

@guoday
Copy link
Contributor

guoday commented Dec 3, 2020

Do you mean gradient_accumulation_steps? The code has implemented. You can add option --gradient_accumulation_steps n for incremental training

@Manas-Embold
Copy link
Author

Manas-Embold commented Dec 3, 2020

Alright, thanks much !
I come from tensorflow background, therefore i am unaware of how its done in pytorch.
I would be thankful, if you can let me know what exactly i need to do
Say i run training for 2 epoch, save checkpoint and want to start again from saved check point.

@guoday
Copy link
Contributor

guoday commented Dec 3, 2020

Change "pretrained_model=microsoft/codebert-base" to "pretrained_model=saved_checkpoint_path"

@Manas-Embold
Copy link
Author

Alright.
Thanks Much !!

@Manas-Embold
Copy link
Author

Manas-Embold commented Dec 3, 2020

One more question, just to be sure.
My calls should look like following:

Do i need to use --gradient_accumulation_steps somewhere now ? or just --pretrained_model should be fine

Call 1 for first two epochs:
python run.py --do_train --do_eval --model_type roberta --model_name_or_path "microsoft/codebert-base" --train_filename "../dataset/java/valid.jsonl" --dev_filename "../dataset/java/valid.jsonl" --output_dir "model/java" --max_source_length 256 --max_target_length 128 --beam_size 10 --train_batch_size 8 --eval_batch_size 8 --learning_rate 5e-5 --num_train_epochs 2

Call 2 for training for next two epoch

python run.py --do_train --do_eval --model_type roberta --model_name_or_path "saved_checkpoint_path" --train_filename "../dataset/java/valid.jsonl" --dev_filename "../dataset/java/valid.jsonl" --output_dir "model/java" --max_source_length 256 --max_target_length 128 --beam_size 10 --train_batch_size 8 --eval_batch_size 8 --learning_rate 5e-5 --num_train_epochs 2

@guoday
Copy link
Contributor

guoday commented Dec 3, 2020

just --pretrained_model is fine

@Manas-Embold
Copy link
Author

Thanks

@guoday
Copy link
Contributor

guoday commented Dec 3, 2020

python run.py --do_train --do_eval --model_type roberta --model_name_or_path "saved_checkpoint_path" --train_filename "../dataset/java/valid.jsonl" --dev_filename "../dataset/java/valid.jsonl" --output_dir "model/java" --max_source_length 256 --max_target_length 128 --beam_size 10 --train_batch_size 8 --eval_batch_size 8 --learning_rate 5e-5 --num_train_epochs 2

@guoday
Copy link
Contributor

guoday commented Dec 3, 2020

Sorry, the option should be --load_model_path.

python run.py --do_train --do_eval --model_type roberta --model_name_or_path microsoft/codebert-base --train_filename "../dataset/java/valid.jsonl" --dev_filename "../dataset/java/valid.jsonl" --output_dir "model/java" --max_source_length 256 --max_target_length 128 --beam_size 10 --train_batch_size 8 --eval_batch_size 8 --learning_rate 5e-5 --num_train_epochs 2 --load_model_path $output_dir/checkpoint-best-bleu/pytorch_model.bin

@Manas-Embold
Copy link
Author

Alright,
Thanks once again.

@Manas-Embold
Copy link
Author

Manas-Embold commented Dec 3, 2020

Hi,
Just to test the flow, I started training for 1 epoch, and Model was saved.
python run.py --do_train --do_eval --model_type roberta --model_name_or_path "microsoft/codebert-base" --train_filename "../dataset/javascript/valid.jsonl" --dev_filename "../dataset/javascript/valid.jsonl" --output_dir "model/javascript" --max_source_length 256 --max_target_length 128 --beam_size 10 --train_batch_size 16 --eval_batch_size 16 --learning_rate 5e-5 --num_train_epochs 1

Again I started the training from the trained model for next 2 epochs
python run.py --do_train --do_eval --model_type roberta --model_name_or_path "microsoft/codebert-base" --train_filename "../dataset/javascript/valid.jsonl" --dev_filename "../dataset/javascript/valid.jsonl" --output_dir "model/javascript" --max_source_length 256 --max_target_length 128 --beam_size 10 --train_batch_size 16 --eval_batch_size 16 --learning_rate 5e-5 --num_train_epochs 2 --load_model_path "/content/code/model/javascript/checkpoint-best-bleu/pytorch_model.bin"

Training has started again, but in console it says "Epoch 0" again instead of "Epoch 1"
Is this normal for script to say Epoch 0 again ? Is it actually Epoch 1 as essentially I am incrementally training on last checkpoint model.

Log for first iteration (Epoch 1)
12/03/2020 08:34:08 - INFO - main - Num examples = 3885
12/03/2020 08:34:08 - INFO - main - Batch size = 16
12/03/2020 08:34:08 - INFO - main - Num epoch = 1
epoch 0 loss 6.5622: 100% 243/243 [08:22<00:00, 2.07s/it]
12/03/2020 08:42:34 - INFO - main -
***** Running evaluation *****
12/03/2020 08:42:34 - INFO - main - Num examples = 3885
12/03/2020 08:42:34 - INFO - main - Batch size = 16
12/03/2020 08:45:32 - INFO - main - eval_ppl = 306.69674
12/03/2020 08:45:32 - INFO - main - global_step = 244
12/03/2020 08:45:32 - INFO - main - train_loss = 6.5622
12/03/2020 08:45:32 - INFO - main - ********************
12/03/2020 08:45:34 - INFO - main - Best ppl:306.69674
12/03/2020 08:45:34 - INFO - main - ********************
Total: 1000
12/03/2020 08:53:21 - INFO - main - bleu-4 = 7.58
12/03/2020 08:53:21 - INFO - main - ********************
12/03/2020 08:53:21 - INFO - main - Best bleu:7.58
12/03/2020 08:53:21 - INFO - main - ********************


Log for second iteration (Epoch 2)

12/03/2020 08:58:29 - INFO - main - ***** Running training *****
12/03/2020 08:58:29 - INFO - main - Num examples = 3885
12/03/2020 08:58:29 - INFO - main - Batch size = 16
12/03/2020 08:58:29 - INFO - main - Num epoch = 2
epoch 0 loss 5.4316: 100% 243/243 [08:22<00:00, 2.07s/it]
12/03/2020 09:06:54 - INFO - main -
***** Running evaluation *****
12/03/2020 09:06:54 - INFO - main - Num examples = 3885
12/03/2020 09:06:54 - INFO - main - Batch size = 16
12/03/2020 09:09:50 - INFO - main - eval_ppl = 117.87884
12/03/2020 09:09:50 - INFO - main - global_step = 244
12/03/2020 09:09:50 - INFO - main - train_loss = 5.4316
12/03/2020 09:09:50 - INFO - main - ********************
12/03/2020 09:09:52 - INFO - main - Best ppl:117.87884
12/03/2020 09:09:52 - INFO - main - ********************

@Manas-Embold
Copy link
Author

Manas-Embold commented Dec 3, 2020

Since loss has decreased in subsequent epochs, shall I assume that it is actually Epoch 1 and not epoch 0
In simple terms i want to be sure, that it is not training from scratch again.

@Manas-Embold
Copy link
Author

Note that i am training on valid.jsonl just to quickly test the flow

@guoday
Copy link
Contributor

guoday commented Dec 3, 2020

--load_model_path just only re-load the model from checkpoint, but optimizer and logs will be reset. Maybe for implementing incremental training, we also need to save optimizer and logs.

@Manas-Embold
Copy link
Author

Manas-Embold commented Dec 3, 2020

Alrights
Resetting of logger is fine.
But not optimizer. right ?

@guoday
Copy link
Contributor

guoday commented Dec 3, 2020

Replace run.py with run.txt. You just need to re-run the following command and the program will restore the last checkpoint for incremental training.

lang=ruby #programming language
lr=5e-5
batch_size=32
beam_size=10
source_length=256
target_length=128
data_dir=../dataset
output_dir=model/$lang
train_file=$data_dir/$lang/train.jsonl
dev_file=$data_dir/$lang/valid.jsonl
epochs=10 
pretrained_model=microsoft/codebert-base #Roberta: roberta-base

python run.py --do_train --do_eval --model_type roberta --model_name_or_path $pretrained_model --train_filename $train_file --dev_filename $dev_file --output_dir $output_dir --max_source_length $source_length --max_target_length $target_length --beam_size $beam_size --train_batch_size $batch_size --eval_batch_size $batch_size --learning_rate $lr --num_train_epochs $epochs

@Manas-Embold
Copy link
Author

Many Thanks for prompt response !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants