Skip to content

Latest commit

 

History

History
35 lines (30 loc) · 2.81 KB

Others.md

File metadata and controls

35 lines (30 loc) · 2.81 KB

Other Problems and Some Solutions

About Training Configs

  1. MODEL_DIR:

    • The output log, evaluation results and checkpoints during training are saved in path of MODEL_DIR in "trainval_distributed.py" for each datasets.
    • It can be customized to anywhere. Experiments are divided by folders named by the start date-time.
  2. GPU and Batch-Size:

    • In each "dist_train.sh" file, CUDA_VISIBLE_DEVICES defines the GPU-IDs visible by the process
    • nproc_per_node means GPUs actually used by the PyTorch DDP processes.
    • NOTE: "trainval_distributed.py" uses the length of CUDA_VISIBLE_DEVICES as GPU number rather than nproc_per_node for DDP.
    • If nproc_per_node=1 and the length of CUDA_VISIBLE_DEVICES is 1, single-GPU training will be used.
    • "config.onegpu" means batch-size on each GPU rather than the total batch-size.

About Reproducibility

Problems

  1. Training Phrase:
    • Reproducibility can be only guaranteed on the same environment, with hardwares, softwares and all random seeds fixed.
    • Differeent versions of hardwares like GPUs and CPUs, or softwares like CUDA causes that. Please refer to this issue of PyTorch.
    • For example, model always have the better result A on Machine 1, but always have the worse result B on Machine 2.
    • After various trials, 2 x GPU on CityPersons is more sensitive to enviroment changes than 1 x GPU on Caltech.
  2. Evaluation Phrase

Solutions:

  1. Training Phrase:
    • The same random seed does not always works across all the machines, try change another one.
    • set "self.gen_seed=True" in config and new seed will be printed on the top lines of training log files.
    • If one seed works, fix it in config, e.g. 1763 is printed, so "self.seed=1763" and "self.gen_seed=False".
    • For example, to avoid the performance gains by non-method changes, VLPD was trained with the same machine and fixed seed 1337. (training logs can be downloaded from BaiduYun or GoogleDrive)
  2. Evalutation Phrase:
    • To avoid adjust "val_begin", save checkpoints you want by adjust "save_begin" and "save_end".
    • Then evaluate them offline like Evaluation.md, instead of during training.

← Go back to README.md