Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unrecognized arguments: --warmup-ratio #88

Closed
AI-EnabledSoftwareEngineering-AISE opened this issue May 1, 2022 · 4 comments
Closed

Comments

@AI-EnabledSoftwareEngineering-AISE

Hi,
I am trying to train your model for the caption task, to do that I clone your last updated repository, and then I have followed your instruction. first of all, I faced a max_epoch error that was because of the shell version. After that I tried to train the model it gives me unrecognized arguments: --warmup-ratio=0.06. I go to your train code and I could not find the warmup-ratio variable, did you remove it? How should I solve this issue?

/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  FutureWarning,
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
usage: train.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
                [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE]
                [--aim-repo AIM_REPO] [--aim-run-hash AIM_RUN_HASH]
                [--tensorboard-logdir TENSORBOARD_LOGDIR]
                [--wandb-project WANDB_PROJECT] [--azureml-logging]
                [--seed SEED] [--cpu] [--tpu] [--bf16]
                [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16]
                [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE]
                [--fp16-scale-window FP16_SCALE_WINDOW]
                [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                [--on-cpu-convert-precision] [--min-loss-scale MIN_LOSS_SCALE]
                [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--amp]
                [--amp-batch-retries AMP_BATCH_RETRIES]
                [--amp-init-scale AMP_INIT_SCALE]
                [--amp-scale-window AMP_SCALE_WINDOW] [--user-dir USER_DIR]
                [--empty-cache-freq EMPTY_CACHE_FREQ]
                [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                [--model-parallel-size MODEL_PARALLEL_SIZE]
                [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                [--profile] [--reset-logging] [--suppress-crashes]
                [--use-plasma-view] [--plasma-path PLASMA_PATH]
                [--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,label_smoothed_cross_entropy_with_ctc,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,speech_unit_lm_criterion,wav2vec,vocab_parallel_cross_entropy,scst_reward_criterion,adjust_label_smoothed_cross_entropy,clip_scst_reward_criterion,adjust_label_smoothed_encouraging_loss}]
                [--tokenizer {moses,nltk,space}]
                [--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}]
                [--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}]
                [--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}]
                [--scoring {bert_score,sacrebleu,bleu,chrf,meteor,wer}]
                [--task TASK] [--num-workers NUM_WORKERS]
                [--skip-invalid-size-inputs-valid-test]
                [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
                [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                [--dataset-impl {raw,lazy,cached,mmap,fasta,huffman}]
                [--data-buffer-size DATA_BUFFER_SIZE]
                [--train-subset TRAIN_SUBSET] [--valid-subset VALID_SUBSET]
                [--combine-valid-subsets] [--ignore-unused-valid-subsets]
                [--validate-interval VALIDATE_INTERVAL]
                [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                [--validate-after-updates VALIDATE_AFTER_UPDATES]
                [--fixed-validation-seed FIXED_VALIDATION_SEED]
                [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID]
                [--batch-size-valid BATCH_SIZE_VALID]
                [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM]
                [--gen-subset GEN_SUBSET] [--num-shards NUM_SHARDS]
                [--shard-id SHARD_ID] [--grouped-shuffling]
                [--update-epoch-batch-itr UPDATE_EPOCH_BATCH_ITR]
                [--update-ordered-indices-seed]
                [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                [--distributed-num-procs DISTRIBUTED_NUM_PROCS]
                [--distributed-rank DISTRIBUTED_RANK]
                [--distributed-backend DISTRIBUTED_BACKEND]
                [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID]
                [--distributed-no-spawn]
                [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slowmo}]
                [--ddp-comm-hook {none,fp16}] [--bucket-cap-mb BUCKET_CAP_MB]
                [--fix-batches-to-gpus] [--find-unused-parameters]
                [--gradient-as-bucket-view] [--fast-stat-sync]
                [--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers]
                [--slowmo-momentum SLOWMO_MOMENTUM]
                [--slowmo-base-algorithm SLOWMO_BASE_ALGORITHM]
                [--localsgd-frequency LOCALSGD_FREQUENCY]
                [--nprocs-per-node NPROCS_PER_NODE]
                [--pipeline-model-parallel]
                [--pipeline-balance PIPELINE_BALANCE]
                [--pipeline-devices PIPELINE_DEVICES]
                [--pipeline-chunks PIPELINE_CHUNKS]
                [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                [--pipeline-checkpoint {always,never,except_last}]
                [--zero-sharding {none,os}] [--no-reshard-after-forward]
                [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state]
                [--not-fsdp-flatten-parameters] [--arch ARCH]
                [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE]
                [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM]
                [--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR]
                [--stop-min-lr STOP_MIN_LR] [--use-bmuf]
                [--skip-remainder-batch] [--save-dir SAVE_DIR]
                [--restore-file RESTORE_FILE] [--continue-once CONTINUE_ONCE]
                [--finetune-from-model FINETUNE_FROM_MODEL]
                [--reset-dataloader] [--reset-lr-scheduler] [--reset-meters]
                [--reset-optimizer]
                [--optimizer-overrides OPTIMIZER_OVERRIDES]
                [--save-interval SAVE_INTERVAL]
                [--save-interval-updates SAVE_INTERVAL_UPDATES]
                [--keep-interval-updates KEEP_INTERVAL_UPDATES]
                [--keep-interval-updates-pattern KEEP_INTERVAL_UPDATES_PATTERN]
                [--keep-last-epochs KEEP_LAST_EPOCHS]
                [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save]
                [--no-epoch-checkpoints] [--no-last-checkpoints]
                [--no-save-optimizer-state]
                [--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
                [--maximize-best-checkpoint-metric] [--patience PATIENCE]
                [--checkpoint-suffix CHECKPOINT_SUFFIX]
                [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                [--load-checkpoint-on-all-dp-ranks]
                [--write-checkpoints-asynchronously] [--store-ema]
                [--ema-decay EMA_DECAY] [--ema-start-update EMA_START_UPDATE]
                [--ema-seed-model EMA_SEED_MODEL]
                [--ema-update-freq EMA_UPDATE_FREQ] [--ema-fp32]
                [--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}]
                [--dropout D] [--attention-dropout D] [--activation-dropout D]
                [--encoder-embed-path STR] [--encoder-embed-dim N]
                [--encoder-ffn-embed-dim N] [--encoder-layers N]
                [--encoder-attention-heads N] [--encoder-normalize-before]
                [--encoder-learned-pos] [--decoder-embed-path STR]
                [--decoder-embed-dim N] [--decoder-ffn-embed-dim N]
                [--decoder-layers N] [--decoder-attention-heads N]
                [--decoder-learned-pos] [--decoder-normalize-before]
                [--decoder-output-dim N] [--share-decoder-input-output-embed]
                [--share-all-embeddings] [--no-token-positional-embeddings]
                [--adaptive-softmax-cutoff EXPR]
                [--adaptive-softmax-dropout D] [--layernorm-embedding]
                [--no-scale-embedding] [--checkpoint-activations]
                [--offload-activations] [--no-cross-attention]
                [--cross-self-attention] [--encoder-layerdrop D]
                [--decoder-layerdrop D]
                [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP]
                [--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP]
                [--quant-noise-pq D] [--quant-noise-pq-block-size D]
                [--quant-noise-scalar D] [--min-params-to-wrap D]
                [--resnet-drop-path-rate RESNET_DROP_PATH_RATE]
                [--encoder-drop-path-rate ENCODER_DROP_PATH_RATE]
                [--decoder-drop-path-rate DECODER_DROP_PATH_RATE]
                [--token-bucket-size TOKEN_BUCKET_SIZE]
                [--image-bucket-size IMAGE_BUCKET_SIZE]
                [--attn-scale-factor ATTN_SCALE_FACTOR] [--freeze-resnet]
                [--freeze-encoder-embedding] [--freeze-decoder-embedding]
                [--add-type-embedding]
                [--resnet-type {resnet50,resnet101,resnet152}]
                [--resnet-model-path STR] [--code-image-size CODE_IMAGE_SIZE]
                [--patch-layernorm-embedding] [--code-layernorm-embedding]
                [--entangle-position-embedding] [--disable-entangle]
                [--sync-bn] [--scale-attn] [--scale-fc] [--scale-heads]
                [--scale-resids] [--pooler-dropout D]
                [--pooler-classifier {mlp,linear}]
                [--pooler-activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}]
                [--spectral-norm-classification-head]
                [--selected-cols SELECTED_COLS] [--bpe-dir BPE_DIR]
                [--max-source-positions MAX_SOURCE_POSITIONS]
                [--max-target-positions MAX_TARGET_POSITIONS]
                [--max-src-length MAX_SRC_LENGTH]
                [--max-tgt-length MAX_TGT_LENGTH]
                [--code-dict-size CODE_DICT_SIZE]
                [--patch-image-size PATCH_IMAGE_SIZE] [--num-bins NUM_BINS]
                [--imagenet-default-mean-and-std]
                [--constraint-range CONSTRAINT_RANGE] [--eval-bleu]
                [--eval-cider] [--eval-args EVAL_ARGS] [--eval-print-samples]
                [--eval-cider-cached-tokens EVAL_CIDER_CACHED_TOKENS] [--scst]
                [--scst-args SCST_ARGS] [--label-smoothing LABEL_SMOOTHING]
                [--report-accuracy] [--ignore-prefix-size IGNORE_PREFIX_SIZE]
                [--ignore-eos] [--drop-worst-ratio DROP_WORST_RATIO]
                [--drop-worst-after DROP_WORST_AFTER] [--use-rdrop]
                [--reg-alpha REG_ALPHA] [--sample-patch-num SAMPLE_PATCH_NUM]
                [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS]
                [--weight-decay WEIGHT_DECAY] [--use-old-adam]
                [--fp16-adam-stats] [--warmup-updates WARMUP_UPDATES]
                [--force-anneal FORCE_ANNEAL]
                [--end-learning-rate END_LEARNING_RATE] [--power POWER]
                [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS]
                [--unk UNK]
                data
usage: train.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
                [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE]
                [--aim-repo AIM_REPO] [--aim-run-hash AIM_RUN_HASH]
                [--tensorboard-logdir TENSORBOARD_LOGDIR]
                [--wandb-project WANDB_PROJECT] [--azureml-logging]
                [--seed SEED] [--cpu] [--tpu] [--bf16]
                [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16]
                [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE]
                [--fp16-scale-window FP16_SCALE_WINDOW]
                [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                [--on-cpu-convert-precision] [--min-loss-scale MIN_LOSS_SCALE]
                [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--amp]
                [--amp-batch-retries AMP_BATCH_RETRIES]
                [--amp-init-scale AMP_INIT_SCALE]
                [--amp-scale-window AMP_SCALE_WINDOW] [--user-dir USER_DIR]
                [--empty-cache-freq EMPTY_CACHE_FREQ]
                [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                [--model-parallel-size MODEL_PARALLEL_SIZE]
                [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                [--profile] [--reset-logging] [--suppress-crashes]
                [--use-plasma-view] [--plasma-path PLASMA_PATH]
                [--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,label_smoothed_cross_entropy_with_ctc,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,speech_unit_lm_criterion,wav2vec,vocab_parallel_cross_entropy,scst_reward_criterion,adjust_label_smoothed_cross_entropy,clip_scst_reward_criterion,adjust_label_smoothed_encouraging_loss}]
                [--tokenizer {moses,nltk,space}]
                [--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}]
                [--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}]
                [--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}]
                [--scoring {bert_score,sacrebleu,bleu,chrf,meteor,wer}]
                [--task TASK] [--num-workers NUM_WORKERS]
                [--skip-invalid-size-inputs-valid-test]
                [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
                [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                [--dataset-impl {raw,lazy,cached,mmap,fasta,huffman}]
                [--data-buffer-size DATA_BUFFER_SIZE]
                [--train-subset TRAIN_SUBSET] [--valid-subset VALID_SUBSET]
                [--combine-valid-subsets] [--ignore-unused-valid-subsets]
                [--validate-interval VALIDATE_INTERVAL]
                [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                [--validate-after-updates VALIDATE_AFTER_UPDATES]
                [--fixed-validation-seed FIXED_VALIDATION_SEED]
                [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID]
                [--batch-size-valid BATCH_SIZE_VALID]
                [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM]
                [--gen-subset GEN_SUBSET] [--num-shards NUM_SHARDS]
                [--shard-id SHARD_ID] [--grouped-shuffling]
                [--update-epoch-batch-itr UPDATE_EPOCH_BATCH_ITR]
                [--update-ordered-indices-seed]
                [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                [--distributed-num-procs DISTRIBUTED_NUM_PROCS]
                [--distributed-rank DISTRIBUTED_RANK]
                [--distributed-backend DISTRIBUTED_BACKEND]
                [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID]
                [--distributed-no-spawn]
                [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slowmo}]
                [--ddp-comm-hook {none,fp16}] [--bucket-cap-mb BUCKET_CAP_MB]
                [--fix-batches-to-gpus] [--find-unused-parameters]
                [--gradient-as-bucket-view] [--fast-stat-sync]
                [--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers]
                [--slowmo-momentum SLOWMO_MOMENTUM]
                [--slowmo-base-algorithm SLOWMO_BASE_ALGORITHM]
                [--localsgd-frequency LOCALSGD_FREQUENCY]
                [--nprocs-per-node NPROCS_PER_NODE]
                [--pipeline-model-parallel]
                [--pipeline-balance PIPELINE_BALANCE]
                [--pipeline-devices PIPELINE_DEVICES]
                [--pipeline-chunks PIPELINE_CHUNKS]
                [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                [--pipeline-checkpoint {always,never,except_last}]
                [--zero-sharding {none,os}] [--no-reshard-after-forward]
                [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state]
                [--not-fsdp-flatten-parameters] [--arch ARCH]
                [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE]
                [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM]
                [--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR]
                [--stop-min-lr STOP_MIN_LR] [--use-bmuf]
                [--skip-remainder-batch] [--save-dir SAVE_DIR]
                [--restore-file RESTORE_FILE] [--continue-once CONTINUE_ONCE]
                [--finetune-from-model FINETUNE_FROM_MODEL]
                [--reset-dataloader] [--reset-lr-scheduler] [--reset-meters]
                [--reset-optimizer]
                [--optimizer-overrides OPTIMIZER_OVERRIDES]
                [--save-interval SAVE_INTERVAL]
                [--save-interval-updates SAVE_INTERVAL_UPDATES]
                [--keep-interval-updates KEEP_INTERVAL_UPDATES]
                [--keep-interval-updates-pattern KEEP_INTERVAL_UPDATES_PATTERN]
                [--keep-last-epochs KEEP_LAST_EPOCHS]
                [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save]
                [--no-epoch-checkpoints] [--no-last-checkpoints]
                [--no-save-optimizer-state]
                [--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
                [--maximize-best-checkpoint-metric] [--patience PATIENCE]
                [--checkpoint-suffix CHECKPOINT_SUFFIX]
                [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                [--load-checkpoint-on-all-dp-ranks]
                [--write-checkpoints-asynchronously] [--store-ema]
                [--ema-decay EMA_DECAY] [--ema-start-update EMA_START_UPDATE]
                [--ema-seed-model EMA_SEED_MODEL]
                [--ema-update-freq EMA_UPDATE_FREQ] [--ema-fp32]
                [--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}]
                [--dropout D] [--attention-dropout D] [--activation-dropout D]
                [--encoder-embed-path STR] [--encoder-embed-dim N]
                [--encoder-ffn-embed-dim N] [--encoder-layers N]
                [--encoder-attention-heads N] [--encoder-normalize-before]
                [--encoder-learned-pos] [--decoder-embed-path STR]
                [--decoder-embed-dim N] [--decoder-ffn-embed-dim N]
                [--decoder-layers N] [--decoder-attention-heads N]
                [--decoder-learned-pos] [--decoder-normalize-before]
                [--decoder-output-dim N] [--share-decoder-input-output-embed]
                [--share-all-embeddings] [--no-token-positional-embeddings]
                [--adaptive-softmax-cutoff EXPR]
                [--adaptive-softmax-dropout D] [--layernorm-embedding]
                [--no-scale-embedding] [--checkpoint-activations]
                [--offload-activations] [--no-cross-attention]
                [--cross-self-attention] [--encoder-layerdrop D]
                [--decoder-layerdrop D]
                [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP]
                [--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP]
                [--quant-noise-pq D] [--quant-noise-pq-block-size D]
                [--quant-noise-scalar D] [--min-params-to-wrap D]
                [--resnet-drop-path-rate RESNET_DROP_PATH_RATE]
                [--encoder-drop-path-rate ENCODER_DROP_PATH_RATE]
                [--decoder-drop-path-rate DECODER_DROP_PATH_RATE]
                [--token-bucket-size TOKEN_BUCKET_SIZE]
                [--image-bucket-size IMAGE_BUCKET_SIZE]
                [--attn-scale-factor ATTN_SCALE_FACTOR] [--freeze-resnet]
                [--freeze-encoder-embedding] [--freeze-decoder-embedding]
                [--add-type-embedding]
                [--resnet-type {resnet50,resnet101,resnet152}]
                [--resnet-model-path STR] [--code-image-size CODE_IMAGE_SIZE]
                [--patch-layernorm-embedding] [--code-layernorm-embedding]
                [--entangle-position-embedding] [--disable-entangle]
                [--sync-bn] [--scale-attn] [--scale-fc] [--scale-heads]
                [--scale-resids] [--pooler-dropout D]
                [--pooler-classifier {mlp,linear}]
                [--pooler-activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}]
                [--spectral-norm-classification-head]
                [--selected-cols SELECTED_COLS] [--bpe-dir BPE_DIR]
                [--max-source-positions MAX_SOURCE_POSITIONS]
                [--max-target-positions MAX_TARGET_POSITIONS]
                [--max-src-length MAX_SRC_LENGTH]
                [--max-tgt-length MAX_TGT_LENGTH]
                [--code-dict-size CODE_DICT_SIZE]
                [--patch-image-size PATCH_IMAGE_SIZE] [--num-bins NUM_BINS]
                [--imagenet-default-mean-and-std]
                [--constraint-range CONSTRAINT_RANGE] [--eval-bleu]
                [--eval-cider] [--eval-args EVAL_ARGS] [--eval-print-samples]
                [--eval-cider-cached-tokens EVAL_CIDER_CACHED_TOKENS] [--scst]
                [--scst-args SCST_ARGS] [--label-smoothing LABEL_SMOOTHING]
                [--report-accuracy] [--ignore-prefix-size IGNORE_PREFIX_SIZE]
                [--ignore-eos] [--drop-worst-ratio DROP_WORST_RATIO]
                [--drop-worst-after DROP_WORST_AFTER] [--use-rdrop]
                [--reg-alpha REG_ALPHA] [--sample-patch-num SAMPLE_PATCH_NUM]
                [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS]
                [--weight-decay WEIGHT_DECAY] [--use-old-adam]
                [--fp16-adam-stats] [--warmup-updates WARMUP_UPDATES]
                [--force-anneal FORCE_ANNEAL]
                [--end-learning-rate END_LEARNING_RATE] [--power POWER]
                [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS]
                [--unk UNK]
                data
train.py: error: unrecognized arguments: --warmup-ratio=0.06
train.py: error: unrecognized arguments: --warmup-ratio=0.06
usage: train.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
                [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE]
                [--aim-repo AIM_REPO] [--aim-run-hash AIM_RUN_HASH]
                [--tensorboard-logdir TENSORBOARD_LOGDIR]
                [--wandb-project WANDB_PROJECT] [--azureml-logging]
                [--seed SEED] [--cpu] [--tpu] [--bf16]
                [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16]
                [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE]
                [--fp16-scale-window FP16_SCALE_WINDOW]
                [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                [--on-cpu-convert-precision] [--min-loss-scale MIN_LOSS_SCALE]
                [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--amp]
                [--amp-batch-retries AMP_BATCH_RETRIES]
                [--amp-init-scale AMP_INIT_SCALE]
                [--amp-scale-window AMP_SCALE_WINDOW] [--user-dir USER_DIR]
                [--empty-cache-freq EMPTY_CACHE_FREQ]
                [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                [--model-parallel-size MODEL_PARALLEL_SIZE]
                [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                [--profile] [--reset-logging] [--suppress-crashes]
                [--use-plasma-view] [--plasma-path PLASMA_PATH]
                [--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,label_smoothed_cross_entropy_with_ctc,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,speech_unit_lm_criterion,wav2vec,vocab_parallel_cross_entropy,scst_reward_criterion,adjust_label_smoothed_cross_entropy,clip_scst_reward_criterion,adjust_label_smoothed_encouraging_loss}]
                [--tokenizer {moses,nltk,space}]
                [--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}]
                [--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}]
                [--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}]
                [--scoring {bert_score,sacrebleu,bleu,chrf,meteor,wer}]
                [--task TASK] [--num-workers NUM_WORKERS]
                [--skip-invalid-size-inputs-valid-test]
                [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
                [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                [--dataset-impl {raw,lazy,cached,mmap,fasta,huffman}]
                [--data-buffer-size DATA_BUFFER_SIZE]
                [--train-subset TRAIN_SUBSET] [--valid-subset VALID_SUBSET]
                [--combine-valid-subsets] [--ignore-unused-valid-subsets]
                [--validate-interval VALIDATE_INTERVAL]
                [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                [--validate-after-updates VALIDATE_AFTER_UPDATES]
                [--fixed-validation-seed FIXED_VALIDATION_SEED]
                [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID]
                [--batch-size-valid BATCH_SIZE_VALID]
                [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM]
                [--gen-subset GEN_SUBSET] [--num-shards NUM_SHARDS]
                [--shard-id SHARD_ID] [--grouped-shuffling]
                [--update-epoch-batch-itr UPDATE_EPOCH_BATCH_ITR]
                [--update-ordered-indices-seed]
                [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                [--distributed-num-procs DISTRIBUTED_NUM_PROCS]
                [--distributed-rank DISTRIBUTED_RANK]
                [--distributed-backend DISTRIBUTED_BACKEND]
                [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID]
                [--distributed-no-spawn]
                [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slowmo}]
                [--ddp-comm-hook {none,fp16}] [--bucket-cap-mb BUCKET_CAP_MB]
                [--fix-batches-to-gpus] [--find-unused-parameters]
                [--gradient-as-bucket-view] [--fast-stat-sync]
                [--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers]
                [--slowmo-momentum SLOWMO_MOMENTUM]
                [--slowmo-base-algorithm SLOWMO_BASE_ALGORITHM]
                [--localsgd-frequency LOCALSGD_FREQUENCY]
                [--nprocs-per-node NPROCS_PER_NODE]
                [--pipeline-model-parallel]
                [--pipeline-balance PIPELINE_BALANCE]
                [--pipeline-devices PIPELINE_DEVICES]
                [--pipeline-chunks PIPELINE_CHUNKS]
                [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                [--pipeline-checkpoint {always,never,except_last}]
                [--zero-sharding {none,os}] [--no-reshard-after-forward]
                [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state]
                [--not-fsdp-flatten-parameters] [--arch ARCH]
                [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE]
                [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM]
                [--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR]
                [--stop-min-lr STOP_MIN_LR] [--use-bmuf]
                [--skip-remainder-batch] [--save-dir SAVE_DIR]
                [--restore-file RESTORE_FILE] [--continue-once CONTINUE_ONCE]
                [--finetune-from-model FINETUNE_FROM_MODEL]
                [--reset-dataloader] [--reset-lr-scheduler] [--reset-meters]
                [--reset-optimizer]
                [--optimizer-overrides OPTIMIZER_OVERRIDES]
                [--save-interval SAVE_INTERVAL]
                [--save-interval-updates SAVE_INTERVAL_UPDATES]
                [--keep-interval-updates KEEP_INTERVAL_UPDATES]
                [--keep-interval-updates-pattern KEEP_INTERVAL_UPDATES_PATTERN]
                [--keep-last-epochs KEEP_LAST_EPOCHS]
                [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save]
                [--no-epoch-checkpoints] [--no-last-checkpoints]
                [--no-save-optimizer-state]
                [--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
                [--maximize-best-checkpoint-metric] [--patience PATIENCE]
                [--checkpoint-suffix CHECKPOINT_SUFFIX]
                [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                [--load-checkpoint-on-all-dp-ranks]
                [--write-checkpoints-asynchronously] [--store-ema]
                [--ema-decay EMA_DECAY] [--ema-start-update EMA_START_UPDATE]
                [--ema-seed-model EMA_SEED_MODEL]
                [--ema-update-freq EMA_UPDATE_FREQ] [--ema-fp32]
                [--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}]
                [--dropout D] [--attention-dropout D] [--activation-dropout D]
                [--encoder-embed-path STR] [--encoder-embed-dim N]
                [--encoder-ffn-embed-dim N] [--encoder-layers N]
                [--encoder-attention-heads N] [--encoder-normalize-before]
                [--encoder-learned-pos] [--decoder-embed-path STR]
                [--decoder-embed-dim N] [--decoder-ffn-embed-dim N]
                [--decoder-layers N] [--decoder-attention-heads N]
                [--decoder-learned-pos] [--decoder-normalize-before]
                [--decoder-output-dim N] [--share-decoder-input-output-embed]
                [--share-all-embeddings] [--no-token-positional-embeddings]
                [--adaptive-softmax-cutoff EXPR]
                [--adaptive-softmax-dropout D] [--layernorm-embedding]
                [--no-scale-embedding] [--checkpoint-activations]
                [--offload-activations] [--no-cross-attention]
                [--cross-self-attention] [--encoder-layerdrop D]
                [--decoder-layerdrop D]
                [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP]
                [--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP]
                [--quant-noise-pq D] [--quant-noise-pq-block-size D]
                [--quant-noise-scalar D] [--min-params-to-wrap D]
                [--resnet-drop-path-rate RESNET_DROP_PATH_RATE]
                [--encoder-drop-path-rate ENCODER_DROP_PATH_RATE]
                [--decoder-drop-path-rate DECODER_DROP_PATH_RATE]
                [--token-bucket-size TOKEN_BUCKET_SIZE]
                [--image-bucket-size IMAGE_BUCKET_SIZE]
                [--attn-scale-factor ATTN_SCALE_FACTOR] [--freeze-resnet]
                [--freeze-encoder-embedding] [--freeze-decoder-embedding]
                [--add-type-embedding]
                [--resnet-type {resnet50,resnet101,resnet152}]
                [--resnet-model-path STR] [--code-image-size CODE_IMAGE_SIZE]
                [--patch-layernorm-embedding] [--code-layernorm-embedding]
                [--entangle-position-embedding] [--disable-entangle]
                [--sync-bn] [--scale-attn] [--scale-fc] [--scale-heads]
                [--scale-resids] [--pooler-dropout D]
                [--pooler-classifier {mlp,linear}]
                [--pooler-activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}]
                [--spectral-norm-classification-head]
                [--selected-cols SELECTED_COLS] [--bpe-dir BPE_DIR]
                [--max-source-positions MAX_SOURCE_POSITIONS]
                [--max-target-positions MAX_TARGET_POSITIONS]
                [--max-src-length MAX_SRC_LENGTH]
                [--max-tgt-length MAX_TGT_LENGTH]
                [--code-dict-size CODE_DICT_SIZE]
                [--patch-image-size PATCH_IMAGE_SIZE] [--num-bins NUM_BINS]
                [--imagenet-default-mean-and-std]
                [--constraint-range CONSTRAINT_RANGE] [--eval-bleu]
                [--eval-cider] [--eval-args EVAL_ARGS] [--eval-print-samples]
                [--eval-cider-cached-tokens EVAL_CIDER_CACHED_TOKENS] [--scst]
                [--scst-args SCST_ARGS] [--label-smoothing LABEL_SMOOTHING]
                [--report-accuracy] [--ignore-prefix-size IGNORE_PREFIX_SIZE]
                [--ignore-eos] [--drop-worst-ratio DROP_WORST_RATIO]
                [--drop-worst-after DROP_WORST_AFTER] [--use-rdrop]
                [--reg-alpha REG_ALPHA] [--sample-patch-num SAMPLE_PATCH_NUM]
                [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS]
                [--weight-decay WEIGHT_DECAY] [--use-old-adam]
                [--fp16-adam-stats] [--warmup-updates WARMUP_UPDATES]
                [--force-anneal FORCE_ANNEAL]
                [--end-learning-rate END_LEARNING_RATE] [--power POWER]
                [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS]
                [--unk UNK]
                data
train.py: error: unrecognized arguments: --warmup-ratio=0.06
usage: train.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
                [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE]
                [--aim-repo AIM_REPO] [--aim-run-hash AIM_RUN_HASH]
                [--tensorboard-logdir TENSORBOARD_LOGDIR]
                [--wandb-project WANDB_PROJECT] [--azureml-logging]
                [--seed SEED] [--cpu] [--tpu] [--bf16]
                [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16]
                [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE]
                [--fp16-scale-window FP16_SCALE_WINDOW]
                [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                [--on-cpu-convert-precision] [--min-loss-scale MIN_LOSS_SCALE]
                [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--amp]
                [--amp-batch-retries AMP_BATCH_RETRIES]
                [--amp-init-scale AMP_INIT_SCALE]
                [--amp-scale-window AMP_SCALE_WINDOW] [--user-dir USER_DIR]
                [--empty-cache-freq EMPTY_CACHE_FREQ]
                [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                [--model-parallel-size MODEL_PARALLEL_SIZE]
                [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                [--profile] [--reset-logging] [--suppress-crashes]
                [--use-plasma-view] [--plasma-path PLASMA_PATH]
                [--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,label_smoothed_cross_entropy_with_ctc,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,speech_unit_lm_criterion,wav2vec,vocab_parallel_cross_entropy,scst_reward_criterion,adjust_label_smoothed_cross_entropy,clip_scst_reward_criterion,adjust_label_smoothed_encouraging_loss}]
                [--tokenizer {moses,nltk,space}]
                [--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}]
                [--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}]
                [--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}]
                [--scoring {bert_score,sacrebleu,bleu,chrf,meteor,wer}]
                [--task TASK] [--num-workers NUM_WORKERS]
                [--skip-invalid-size-inputs-valid-test]
                [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
                [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                [--dataset-impl {raw,lazy,cached,mmap,fasta,huffman}]
                [--data-buffer-size DATA_BUFFER_SIZE]
                [--train-subset TRAIN_SUBSET] [--valid-subset VALID_SUBSET]
                [--combine-valid-subsets] [--ignore-unused-valid-subsets]
                [--validate-interval VALIDATE_INTERVAL]
                [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                [--validate-after-updates VALIDATE_AFTER_UPDATES]
                [--fixed-validation-seed FIXED_VALIDATION_SEED]
                [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID]
                [--batch-size-valid BATCH_SIZE_VALID]
                [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM]
                [--gen-subset GEN_SUBSET] [--num-shards NUM_SHARDS]
                [--shard-id SHARD_ID] [--grouped-shuffling]
                [--update-epoch-batch-itr UPDATE_EPOCH_BATCH_ITR]
                [--update-ordered-indices-seed]
                [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                [--distributed-num-procs DISTRIBUTED_NUM_PROCS]
                [--distributed-rank DISTRIBUTED_RANK]
                [--distributed-backend DISTRIBUTED_BACKEND]
                [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID]
                [--distributed-no-spawn]
                [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slowmo}]
                [--ddp-comm-hook {none,fp16}] [--bucket-cap-mb BUCKET_CAP_MB]
                [--fix-batches-to-gpus] [--find-unused-parameters]
                [--gradient-as-bucket-view] [--fast-stat-sync]
                [--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers]
                [--slowmo-momentum SLOWMO_MOMENTUM]
                [--slowmo-base-algorithm SLOWMO_BASE_ALGORITHM]
                [--localsgd-frequency LOCALSGD_FREQUENCY]
                [--nprocs-per-node NPROCS_PER_NODE]
                [--pipeline-model-parallel]
                [--pipeline-balance PIPELINE_BALANCE]
                [--pipeline-devices PIPELINE_DEVICES]
                [--pipeline-chunks PIPELINE_CHUNKS]
                [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                [--pipeline-checkpoint {always,never,except_last}]
                [--zero-sharding {none,os}] [--no-reshard-after-forward]
                [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state]
                [--not-fsdp-flatten-parameters] [--arch ARCH]
                [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE]
                [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM]
                [--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR]
                [--stop-min-lr STOP_MIN_LR] [--use-bmuf]
                [--skip-remainder-batch] [--save-dir SAVE_DIR]
                [--restore-file RESTORE_FILE] [--continue-once CONTINUE_ONCE]
                [--finetune-from-model FINETUNE_FROM_MODEL]
                [--reset-dataloader] [--reset-lr-scheduler] [--reset-meters]
                [--reset-optimizer]
                [--optimizer-overrides OPTIMIZER_OVERRIDES]
                [--save-interval SAVE_INTERVAL]
                [--save-interval-updates SAVE_INTERVAL_UPDATES]
                [--keep-interval-updates KEEP_INTERVAL_UPDATES]
                [--keep-interval-updates-pattern KEEP_INTERVAL_UPDATES_PATTERN]
                [--keep-last-epochs KEEP_LAST_EPOCHS]
                [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save]
                [--no-epoch-checkpoints] [--no-last-checkpoints]
                [--no-save-optimizer-state]
                [--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
                [--maximize-best-checkpoint-metric] [--patience PATIENCE]
                [--checkpoint-suffix CHECKPOINT_SUFFIX]
                [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                [--load-checkpoint-on-all-dp-ranks]
                [--write-checkpoints-asynchronously] [--store-ema]
                [--ema-decay EMA_DECAY] [--ema-start-update EMA_START_UPDATE]
                [--ema-seed-model EMA_SEED_MODEL]
                [--ema-update-freq EMA_UPDATE_FREQ] [--ema-fp32]
                [--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}]
                [--dropout D] [--attention-dropout D] [--activation-dropout D]
                [--encoder-embed-path STR] [--encoder-embed-dim N]
                [--encoder-ffn-embed-dim N] [--encoder-layers N]
                [--encoder-attention-heads N] [--encoder-normalize-before]
                [--encoder-learned-pos] [--decoder-embed-path STR]
                [--decoder-embed-dim N] [--decoder-ffn-embed-dim N]
                [--decoder-layers N] [--decoder-attention-heads N]
                [--decoder-learned-pos] [--decoder-normalize-before]
                [--decoder-output-dim N] [--share-decoder-input-output-embed]
                [--share-all-embeddings] [--no-token-positional-embeddings]
                [--adaptive-softmax-cutoff EXPR]
                [--adaptive-softmax-dropout D] [--layernorm-embedding]
                [--no-scale-embedding] [--checkpoint-activations]
                [--offload-activations] [--no-cross-attention]
                [--cross-self-attention] [--encoder-layerdrop D]
                [--decoder-layerdrop D]
                [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP]
                [--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP]
                [--quant-noise-pq D] [--quant-noise-pq-block-size D]
                [--quant-noise-scalar D] [--min-params-to-wrap D]
                [--resnet-drop-path-rate RESNET_DROP_PATH_RATE]
                [--encoder-drop-path-rate ENCODER_DROP_PATH_RATE]
                [--decoder-drop-path-rate DECODER_DROP_PATH_RATE]
                [--token-bucket-size TOKEN_BUCKET_SIZE]
                [--image-bucket-size IMAGE_BUCKET_SIZE]
                [--attn-scale-factor ATTN_SCALE_FACTOR] [--freeze-resnet]
                [--freeze-encoder-embedding] [--freeze-decoder-embedding]
                [--add-type-embedding]
                [--resnet-type {resnet50,resnet101,resnet152}]
                [--resnet-model-path STR] [--code-image-size CODE_IMAGE_SIZE]
                [--patch-layernorm-embedding] [--code-layernorm-embedding]
                [--entangle-position-embedding] [--disable-entangle]
                [--sync-bn] [--scale-attn] [--scale-fc] [--scale-heads]
                [--scale-resids] [--pooler-dropout D]
                [--pooler-classifier {mlp,linear}]
                [--pooler-activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}]
                [--spectral-norm-classification-head]
                [--selected-cols SELECTED_COLS] [--bpe-dir BPE_DIR]
                [--max-source-positions MAX_SOURCE_POSITIONS]
                [--max-target-positions MAX_TARGET_POSITIONS]
                [--max-src-length MAX_SRC_LENGTH]
                [--max-tgt-length MAX_TGT_LENGTH]
                [--code-dict-size CODE_DICT_SIZE]
                [--patch-image-size PATCH_IMAGE_SIZE] [--num-bins NUM_BINS]
                [--imagenet-default-mean-and-std]
                [--constraint-range CONSTRAINT_RANGE] [--eval-bleu]
                [--eval-cider] [--eval-args EVAL_ARGS] [--eval-print-samples]
                [--eval-cider-cached-tokens EVAL_CIDER_CACHED_TOKENS] [--scst]
                [--scst-args SCST_ARGS] [--label-smoothing LABEL_SMOOTHING]
                [--report-accuracy] [--ignore-prefix-size IGNORE_PREFIX_SIZE]
                [--ignore-eos] [--drop-worst-ratio DROP_WORST_RATIO]
                [--drop-worst-after DROP_WORST_AFTER] [--use-rdrop]
                [--reg-alpha REG_ALPHA] [--sample-patch-num SAMPLE_PATCH_NUM]
                [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS]
                [--weight-decay WEIGHT_DECAY] [--use-old-adam]
                [--fp16-adam-stats] [--warmup-updates WARMUP_UPDATES]
                [--force-anneal FORCE_ANNEAL]
                [--end-learning-rate END_LEARNING_RATE] [--power POWER]
                [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS]
                [--unk UNK]
                data
train.py: error: unrecognized arguments: --warmup-ratio=0.06
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 31799) of binary: /home/XXXX/.conda/envs/ofa/bin/python
Traceback (most recent call last):
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
    )(*cmd_args)
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
../../train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-04-30_22:19:49
  host      : hartley
  rank      : 1 (local_rank: 1)
  exitcode  : 2 (pid: 31800)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2022-04-30_22:19:49
  host      : hartley
  rank      : 2 (local_rank: 2)
  exitcode  : 2 (pid: 31801)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2022-04-30_22:19:49
  host      : hartley
  rank      : 3 (local_rank: 3)
  exitcode  : 2 (pid: 31802)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-04-30_22:19:49
  host      : hartley
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 31799)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
@AI-EnabledSoftwareEngineering-AISE
Copy link
Author

AI-EnabledSoftwareEngineering-AISE commented May 1, 2022

Also, this is my setup to just test your training.

export MASTER_PORT=1051
log_dir=./stage1_logs
save_dir=./stage1_checkpoints
mkdir -p $log_dir $save_dir

bpe_dir=../../utils/BPE
user_dir=../../ofa_module

# data_dir=../../dataset/caption_data
data_dir=/raid/AISSEL/YYYY/datasets/caption_data
data=${data_dir}/caption_stage1_train.tsv,${data_dir}/caption_val.tsv
restore_file=../../checkpoints/ofa_large.pt
selected_cols=0,4,2

task=caption
arch=ofa_large
criterion=adjust_label_smoothed_cross_entropy
label_smoothing=0.1
lr=1e-5
max_epoch=2
warmup_ratio=0.06
batch_size=8
update_freq=4
resnet_drop_path_rate=0.0
encoder_drop_path_rate=0.1
decoder_drop_path_rate=0.1
dropout=0.1
attention_dropout=0.0
max_src_length=80
max_tgt_length=20
num_bins=1000
drop_worst_after=2500
patch_image_size=480
eval_cider_cached=${data_dir}/cider_cached_tokens/coco-valid-words.p
drop_worst_ratio=0.2

CUDA_VISIBLE_DEVICES=0,1 ~/.conda/envs/ofa/bin/python -m torch.distributed.launch --nproc_per_node=4 --master_port=${MASTER_PORT} ../../train.py \
          $data \
          --selected-cols=${selected_cols} \
          --bpe-dir=${bpe_dir} \
          --user-dir=${user_dir} \
          --restore-file=${restore_file} \
          --reset-optimizer --reset-dataloader --reset-meters \
          --save-dir=${save_path} \
          --task=${task} \
          --arch=${arch} \
          --criterion=${criterion} \
          --label-smoothing=${label_smoothing} \
          --batch-size=${batch_size} \
          --update-freq=${update_freq} \
          --encoder-normalize-before \
          --decoder-normalize-before \
          --share-decoder-input-output-embed \
          --share-all-embeddings \
          --layernorm-embedding \
          --patch-layernorm-embedding \
          --code-layernorm-embedding \
          --resnet-drop-path-rate=${resnet_drop_path_rate} \
          --encoder-drop-path-rate=${encoder_drop_path_rate} \
          --decoder-drop-path-rate=${decoder_drop_path_rate} \
          --dropout=${dropout} \
          --attention-dropout=${attention_dropout} \
          --weight-decay=0.01 --optimizer=adam --adam-betas="(0.9,0.999)" --adam-eps=1e-08 --clip-norm=1.0 \
          --lr-scheduler=polynomial_decay --lr=${lr} \
          --max-epoch=${max_epoch} --warmup-ratio=${warmup_ratio} \
          --log-format=simple --log-interval=10 \
          --fixed-validation-seed=7 \
          --no-epoch-checkpoints --keep-best-checkpoints=1 \
          --save-interval=1 --validate-interval=1 \
          --save-interval-updates=500 --validate-interval-updates=500 \
          --eval-cider \
          --eval-cider-cached-tokens=${eval_cider_cached} \
          --eval-args='{"beam":5,"max_len_b":16,"no_repeat_ngram_size":3}' \
          --best-checkpoint-metric=cider --maximize-best-checkpoint-metric \
          --max-src-length=${max_src_length} \
          --max-tgt-length=${max_tgt_length} \
          --find-unused-parameters \
          --freeze-encoder-embedding \
          --freeze-decoder-embedding \
          --add-type-embedding \
          --scale-attn \
          --scale-fc \
          --scale-heads \
          --disable-entangle \
          --num-bins=${num_bins} \
          --patch-image-size=${patch_image_size} \
          --drop-worst-ratio=${drop_worst_ratio} \
          --drop-worst-after=${drop_worst_after} \
          --fp16 \
          --fp16-scale-window=512 \
          --num-workers=0 #> ${log_file} 2>&1

@logicwong
Copy link
Member

@AI-EnabledSoftwareEngineering-AISE Hi, did you use the fairseq library from our repo? The argument warmup-ratio is in OFA/fairseq/fairseq/optim/lr_scheduler/polynomial_decay_schedule.py

@AI-EnabledSoftwareEngineering-AISE
Copy link
Author

Thank you, it was becuse of that I used fairseq from other source.

@zh-zhang1984
Copy link

Hi,Can anyone give some suggestions to solve this issue? how can I shift to use the fairseq library from current repo?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants