Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Performance and Loss when Resuming Training #13

Closed
nuaazs opened this issue Jul 25, 2023 · 4 comments
Closed

Inconsistent Performance and Loss when Resuming Training #13

nuaazs opened this issue Jul 25, 2023 · 4 comments

Comments

@nuaazs
Copy link

nuaazs commented Jul 25, 2023

Thank you for your excellent work. 🙂

We have observed that whenever we resume training with a different number of epochs after training completion, the loaded historical model exhibits significantly lower accuracy compared to the corresponding epoch during the original training. For instance, when loading a model trained for 100 epochs, its performance is only comparable to that of a model trained for 30 epochs.

This inconsistency in performance after resuming training poses a challenge for us to continue training from a checkpoint and obtain the desired results.

image
image

@nuaazs
Copy link
Author

nuaazs commented Jul 25, 2023

train.log

@nuaazs
Copy link
Author

nuaazs commented Jul 25, 2023

config.yaml

aug_prob: 0.2
augmentations:
  args:
    aug_prob: <aug_prob>
    noise_file: <noise>
    reverb_file: <reverb>
  obj: speakerlab.process.processor.SpkVeriAug
batch_size: 256
checkpointer:
  args:
    checkpoints_dir: <exp_dir>/models
    recoverables:
      classifier: <classifier>
      embedding_model: <embedding_model>
      epoch_counter: <epoch_counter>
  obj: speakerlab.utils.checkpoint.Checkpointer
classifier:
  args:
    input_dim: <embedding_size>
    out_neurons: <num_classes>
  obj: speakerlab.models.campplus.classifier.CosineClassifier
data: data/vox2_dev/train.csv
dataloader:
  args:
    batch_size: <batch_size>
    dataset: <dataset>
    drop_last: true
    num_workers: <num_workers>
    pin_memory: true
  obj: torch.utils.data.DataLoader
dataset:
  args:
    data_file: <data>
    preprocessor: <preprocessor>
  obj: speakerlab.dataset.dataset.WavSVDataset
embedding_model:
  args:
    embed_dim: <embedding_size>
    feat_dim: <fbank_dim>
    num_blocks:
    - 3
    - 3
    - 9
    - 3
    pooling_func: GSP
  obj: speakerlab.models.dfresnet.resnet.DFResNet
embedding_size: 512
epoch_counter:
  args:
    limit: <num_epoch>
  obj: speakerlab.utils.epoch.EpochCounter
exp_dir: exp/dfresnet56
fbank_dim: 80
feature_extractor:
  args:
    mean_nor: true
    n_mels: <fbank_dim>
    sample_rate: <sample_rate>
  obj: speakerlab.process.processor.FBank
label_encoder:
  args:
    data_file: <data>
  obj: speakerlab.process.processor.SpkLabelEncoder
log_batch_freq: 100
loss:
  args:
    easy_margin: false
    margin: 0.2
    scale: 32.0
  obj: speakerlab.loss.margin_loss.ArcMarginLoss
lr: 0.1
lr_scheduler:
  args:
    fix_epoch: <num_epoch>
    max_lr: <lr>
    min_lr: <min_lr>
    optimizer: <optimizer>
    step_per_epoch: null
    warmup_epoch: 5
  obj: speakerlab.process.scheduler.WarmupCosineScheduler
margin_scheduler:
  args:
    criterion: <loss>
    final_margin: 0.2
    fix_epoch: 25
    increase_start_epoch: 15
    initial_margin: 0.0
    step_per_epoch: null
  obj: speakerlab.process.scheduler.MarginScheduler
min_lr: 0.0001
noise: data/musan/wav.scp
num_classes: 5994
num_epoch: 200
num_workers: 16
optimizer:
  args:
    lr: <lr>
    momentum: 0.9
    nesterov: true
    params: null
    weight_decay: 0.0001
  obj: torch.optim.SGD
preprocessor:
  augmentations: <augmentations>
  feature_extractor: <feature_extractor>
  label_encoder: <label_encoder>
  wav_reader: <wav_reader>
reverb: data/rirs/wav.scp
sample_rate: 16000
save_epoch_freq: 2
speed_pertub: true
wav_len: 3.0
wav_reader:
  args:
    duration: <wav_len>
    sample_rate: <sample_rate>
    speed_pertub: <speed_pertub>
  obj: speakerlab.process.processor.WavReader

@wanghuii1
Copy link
Collaborator

@nuaazs The 'lr_scheduler' is a warmup cosine scheduler. If you only set the 'num_epoch' to 200 and then resume training, the learning rate will increase at epoch 100. To avoid this, I recommend adjusting the 'lr_scheduler' configuration to maintain a lower lr value. Alternatively, you may simply need to wait and train for more epochs to achieve optimal performance.

@nuaazs
Copy link
Author

nuaazs commented Jul 25, 2023

Thank you for your response. @wanghuii1
It seems that the poor results are indeed caused by the learning rate being set too high.
Everything works fine after I re-modified the learning rate.

@nuaazs nuaazs closed this as completed Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants