Inconsistent Performance and Loss when Resuming Training #13

nuaazs · 2023-07-25T00:43:57Z

Thank you for your excellent work. 🙂

We have observed that whenever we resume training with a different number of epochs after training completion, the loaded historical model exhibits significantly lower accuracy compared to the corresponding epoch during the original training. For instance, when loading a model trained for 100 epochs, its performance is only comparable to that of a model trained for 30 epochs.

This inconsistency in performance after resuming training poses a challenge for us to continue training from a checkpoint and obtain the desired results.

nuaazs · 2023-07-25T00:47:24Z

train.log

nuaazs · 2023-07-25T00:47:58Z

config.yaml

aug_prob: 0.2
augmentations:
  args:
    aug_prob: <aug_prob>
    noise_file: <noise>
    reverb_file: <reverb>
  obj: speakerlab.process.processor.SpkVeriAug
batch_size: 256
checkpointer:
  args:
    checkpoints_dir: <exp_dir>/models
    recoverables:
      classifier: <classifier>
      embedding_model: <embedding_model>
      epoch_counter: <epoch_counter>
  obj: speakerlab.utils.checkpoint.Checkpointer
classifier:
  args:
    input_dim: <embedding_size>
    out_neurons: <num_classes>
  obj: speakerlab.models.campplus.classifier.CosineClassifier
data: data/vox2_dev/train.csv
dataloader:
  args:
    batch_size: <batch_size>
    dataset: <dataset>
    drop_last: true
    num_workers: <num_workers>
    pin_memory: true
  obj: torch.utils.data.DataLoader
dataset:
  args:
    data_file: <data>
    preprocessor: <preprocessor>
  obj: speakerlab.dataset.dataset.WavSVDataset
embedding_model:
  args:
    embed_dim: <embedding_size>
    feat_dim: <fbank_dim>
    num_blocks:
    - 3
    - 3
    - 9
    - 3
    pooling_func: GSP
  obj: speakerlab.models.dfresnet.resnet.DFResNet
embedding_size: 512
epoch_counter:
  args:
    limit: <num_epoch>
  obj: speakerlab.utils.epoch.EpochCounter
exp_dir: exp/dfresnet56
fbank_dim: 80
feature_extractor:
  args:
    mean_nor: true
    n_mels: <fbank_dim>
    sample_rate: <sample_rate>
  obj: speakerlab.process.processor.FBank
label_encoder:
  args:
    data_file: <data>
  obj: speakerlab.process.processor.SpkLabelEncoder
log_batch_freq: 100
loss:
  args:
    easy_margin: false
    margin: 0.2
    scale: 32.0
  obj: speakerlab.loss.margin_loss.ArcMarginLoss
lr: 0.1
lr_scheduler:
  args:
    fix_epoch: <num_epoch>
    max_lr: <lr>
    min_lr: <min_lr>
    optimizer: <optimizer>
    step_per_epoch: null
    warmup_epoch: 5
  obj: speakerlab.process.scheduler.WarmupCosineScheduler
margin_scheduler:
  args:
    criterion: <loss>
    final_margin: 0.2
    fix_epoch: 25
    increase_start_epoch: 15
    initial_margin: 0.0
    step_per_epoch: null
  obj: speakerlab.process.scheduler.MarginScheduler
min_lr: 0.0001
noise: data/musan/wav.scp
num_classes: 5994
num_epoch: 200
num_workers: 16
optimizer:
  args:
    lr: <lr>
    momentum: 0.9
    nesterov: true
    params: null
    weight_decay: 0.0001
  obj: torch.optim.SGD
preprocessor:
  augmentations: <augmentations>
  feature_extractor: <feature_extractor>
  label_encoder: <label_encoder>
  wav_reader: <wav_reader>
reverb: data/rirs/wav.scp
sample_rate: 16000
save_epoch_freq: 2
speed_pertub: true
wav_len: 3.0
wav_reader:
  args:
    duration: <wav_len>
    sample_rate: <sample_rate>
    speed_pertub: <speed_pertub>
  obj: speakerlab.process.processor.WavReader

wanghuii1 · 2023-07-25T02:02:48Z

@nuaazs The 'lr_scheduler' is a warmup cosine scheduler. If you only set the 'num_epoch' to 200 and then resume training, the learning rate will increase at epoch 100. To avoid this, I recommend adjusting the 'lr_scheduler' configuration to maintain a lower lr value. Alternatively, you may simply need to wait and train for more epochs to achieve optimal performance.

nuaazs · 2023-07-25T02:36:42Z

Thank you for your response. @wanghuii1
It seems that the poor results are indeed caused by the learning rate being set too high.
Everything works fine after I re-modified the learning rate.

nuaazs closed this as completed Jul 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent Performance and Loss when Resuming Training #13

Inconsistent Performance and Loss when Resuming Training #13

nuaazs commented Jul 25, 2023

nuaazs commented Jul 25, 2023

nuaazs commented Jul 25, 2023

wanghuii1 commented Jul 25, 2023

nuaazs commented Jul 25, 2023

Inconsistent Performance and Loss when Resuming Training #13

Inconsistent Performance and Loss when Resuming Training #13

Comments

nuaazs commented Jul 25, 2023

nuaazs commented Jul 25, 2023

nuaazs commented Jul 25, 2023

wanghuii1 commented Jul 25, 2023

nuaazs commented Jul 25, 2023