Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAN loss #4

Closed
tzt101 opened this issue Jun 4, 2021 · 4 comments
Closed

NAN loss #4

tzt101 opened this issue Jun 4, 2021 · 4 comments

Comments

@tzt101
Copy link

tzt101 commented Jun 4, 2021

Hi, I just trained cvt13-224 model with the default settings, but got NAN loss after several epochs.
Does anyone have trained this model sucessfully?
图片

@leoxiaobin
Copy link

hi, @tzt101,

Could you paste the printed configuration for your job?

@tzt101
Copy link
Author

tzt101 commented Jun 5, 2021

This is the configuration, I just keep the default settings.
AMP:
ENABLED: true
MEMORY_FORMAT: nchw
AUG:
COLOR_JITTER:

  • 0.4
  • 0.4
  • 0.4
  • 0.1
  • 0.0
    DROPBLOCK_BLOCK_SIZE: 7
    DROPBLOCK_KEEP_PROB: 1.0
    DROPBLOCK_LAYERS:
  • 3
  • 4
    GAUSSIAN_BLUR: 0.0
    GRAY_SCALE: 0.0
    INTERPOLATION: 2
    MIXCUT: 1.0
    MIXCUT_AND_MIXUP: false
    MIXCUT_MINMAX: []
    MIXUP: 0.8
    MIXUP_MODE: batch
    MIXUP_PROB: 1.0
    MIXUP_SWITCH_PROB: 0.5
    RATIO:
  • 0.75
  • 1.3333333333333333
    SCALE:
  • 0.08
  • 1.0
    TIMM_AUG:
    AUTO_AUGMENT: rand-m9-mstd0.5-inc1
    COLOR_JITTER: 0.4
    HFLIP: 0.5
    INTERPOLATION: bicubic
    RE_COUNT: 1
    RE_MODE: pixel
    RE_PROB: 0.25
    RE_SPLIT: false
    USE_LOADER: true
    USE_TRANSFORM: false
    VFLIP: 0.0
    BASE:
  • ''
    CUDNN:
    BENCHMARK: true
    DETERMINISTIC: false
    ENABLED: true
    DATASET:
    DATASET: imagenet
    DATA_FORMAT: jpg
    LABELMAP: ''
    ROOT: /home/tzt/dataset/imagenet/
    SAMPLER: default
    TARGET_SIZE: -1
    TEST_SET: val
    TEST_TSV_LIST: []
    TRAIN_SET: train
    TRAIN_TSV_LIST: []
    DATA_DIR: ''
    DEBUG:
    DEBUG: false
    DIST_BACKEND: nccl
    FINETUNE:
    BASE_LR: 0.003
    BATCH_SIZE: 512
    EVAL_EVERY: 3000
    FINETUNE: false
    FROZEN_LAYERS: []
    LR_SCHEDULER:
    DECAY_TYPE: step
    TRAIN_MODE: true
    USE_TRAIN_AUG: false
    GPUS:
  • 0
    INPUT:
    MEAN:
    • 0.485
    • 0.456
    • 0.406
      STD:
    • 0.229
    • 0.224
    • 0.225
      LOSS:
      LABEL_SMOOTHING: 0.1
      LOSS: softmax
      MODEL:
      INIT_WEIGHTS: true
      NAME: cls_cvt
      NUM_CLASSES: 1000
      PRETRAINED: ''
      PRETRAINED_LAYERS:
    • '*'
      SPEC:
      ATTN_DROP_RATE:
      • 0.0
      • 0.0
      • 0.0
        CLS_TOKEN:
      • false
      • false
      • true
        DEPTH:
      • 1
      • 2
      • 10
        DIM_EMBED:
      • 64
      • 192
      • 384
        DROP_PATH_RATE:
      • 0.0
      • 0.0
      • 0.1
        DROP_RATE:
      • 0.0
      • 0.0
      • 0.0
        INIT: trunc_norm
        KERNEL_QKV:
      • 3
      • 3
      • 3
        MLP_RATIO:
      • 4.0
      • 4.0
      • 4.0
        NUM_HEADS:
      • 1
      • 3
      • 6
        NUM_STAGES: 3
        PADDING_KV:
      • 1
      • 1
      • 1
        PADDING_Q:
      • 1
      • 1
      • 1
        PATCH_PADDING:
      • 2
      • 1
      • 1
        PATCH_SIZE:
      • 7
      • 3
      • 3
        PATCH_STRIDE:
      • 4
      • 2
      • 2
        POS_EMBED:
      • false
      • false
      • false
        QKV_BIAS:
      • true
      • true
      • true
        QKV_PROJ_METHOD:
      • dw_bn
      • dw_bn
      • dw_bn
        STRIDE_KV:
      • 2
      • 2
      • 2
        STRIDE_Q:
      • 1
      • 1
      • 1
        MODEL_SUMMARY: false
        MULTIPROCESSING_DISTRIBUTED: true
        NAME: cvt-13-224x224
        OUTPUT_DIR: OUTPUT/
        PIN_MEMORY: true
        PRINT_FREQ: 500
        RANK: 0
        TEST:
        BATCH_SIZE_PER_GPU: 32
        CENTER_CROP: true
        IMAGE_SIZE:
    • 224
    • 224
      INTERPOLATION: 3
      MODEL_FILE: ''
      REAL_LABELS: false
      VALID_LABELS: ''
      TRAIN:
      AUTO_RESUME: true
      BATCH_SIZE_PER_GPU: 128
      BEGIN_EPOCH: 0
      CHECKPOINT: ''
      CLIP_GRAD_NORM: 0.0
      DETECT_ANOMALY: false
      END_EPOCH: 300
      EVAL_BEGIN_EPOCH: 0
      GAMMA1: 0.99
      GAMMA2: 0.0
      IMAGE_SIZE:
    • 224
    • 224
      LR: 0.16
      LR_SCHEDULER:
      ARGS:
      cooldown_epochs: 10
      decay_rate: 0.1
      epochs: 300
      min_lr: 1.0e-05
      sched: cosine
      warmup_epochs: 5
      warmup_lr: 1.0e-06
      METHOD: timm
      MOMENTUM: 0.9
      NESTEROV: true
      OPTIMIZER: adamW
      OPTIMIZER_ARGS: {}
      SAVE_ALL_MODELS: false
      SCALE_LR: true
      SHUFFLE: true
      WD: 0.05
      WITHOUT_WD_LIST:
    • bn
    • bias
    • ln
      VERBOSE: true
      WORKERS: 6

@leoxiaobin
Copy link

leoxiaobin commented Jun 5, 2021

it seems that you are using a larger LR.
If you specify BATCH_SIZE_PER_GPU to 128, you should specify LR to 0.000125.
The LR in our config is with respect to BATCH_SIZE_PER_GPU. You are using a much larger LR than our original config. I guess that's the reason you got NaN error.

@tzt101
Copy link
Author

tzt101 commented Jun 5, 2021

Thank you very much! I will try to use small lr later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants