-
Notifications
You must be signed in to change notification settings - Fork 1
Evaluation
After each completion of 25 steps (this configuration parameter is defined in the training script), the following list of metrics are displayed in the terminal :
--> TIME: 2024-06-22 16:58:45 -- STEP: 1076/1233 -- GLOBAL_STEP: 35600
| > loss_disc: 2.601724863052368 (2.5608400611629265)
| > loss_disc_real_0: 0.16440898180007935 (0.18418801065575882)
| > loss_disc_real_1: 0.18971821665763855 (0.2082644184587391)
| > loss_disc_real_2: 0.22247859835624695 (0.22729541624456534)
| > loss_disc_real_3: 0.21355962753295898 (0.23436040888266937)
| > loss_disc_real_4: 0.2299477607011795 (0.23665879754051844)
| > loss_disc_real_5: 0.2520277202129364 (0.2239697749508358)
| > loss_0: 2.601724863052368 (2.5608400611629265)
| > grad_norm_0: tensor(11.8741, device='cuda:0') (tensor(20.7040, device='cuda:0'))
| > loss_gen: 2.253811836242676 (2.2022835016250597)
| > loss_kl: 1.4479506015777588 (1.5328740684737954)
| > loss_feat: 4.815017223358154 (4.963463480144629)
| > loss_mel: 20.43062973022461 (21.177591887548495)
| > loss_duration: 1.4312173128128052 (1.4145199033850628)
| > amp_scaler: 512.0 (567.1970260223051)
| > loss_1: 30.37862777709961 (31.29073280738632)
| > grad_norm_1: tensor(100.4473, device='cuda:0') (tensor(144.6298, device='cuda:0'))
| > current_lr_0: 0.0001993011799713115
| > current_lr_1: 0.0001993011799713115
| > step_time: 10.9037 (6.006802409998102)
| > loader_time: 0.5829 (0.5791234892540256)
These metrics are the training statistics.
After step 1.233, a new epoch begins. But first, an evaluation of the last epoch is done. The following metrics are displayed in the terminal in 25 steps = blocks (398 samples / 16 samples per batch = 25 steps).
--> STEP: 0
| > loss_disc: 2.496549606323242 (2.496549606323242)
| > loss_disc_real_0: 0.1999395489692688 (0.1999395489692688)
| > loss_disc_real_1: 0.21423761546611786 (0.21423761546611786)
| > loss_disc_real_2: 0.21206657588481903 (0.21206657588481903)
| > loss_disc_real_3: 0.21744751930236816 (0.21744751930236816)
| > loss_disc_real_4: 0.1928769201040268 (0.1928769201040268)
| > loss_disc_real_5: 0.23872263729572296 (0.23872263729572296)
| > loss_0: 2.496549606323242 (2.496549606323242)
| > loss_gen: 2.250887393951416 (2.250887393951416)
| > loss_kl: 1.8777085542678833 (1.8777085542678833)
| > loss_feat: 5.428119659423828 (5.428119659423828)
| > loss_mel: 22.569669723510742 (22.569669723510742)
| > loss_duration: 1.5732102394104004 (1.5732102394104004)
| > loss_1: 33.6995964050293 (33.6995964050293)
The metrics used to evaluate the progress and performance of the training progress are the following :
- loss_0 and loss_1
- loss_disc
- loss_disc_real_0 up to loss_disc_real_6
- grad_norm_0 and grad_norm_1
- loss_gen
- loss_kl
- loss_feat
- loss_mel
- loss_duration
- amp_scaler
- current_lr_0 and current_lr_1
- step_time
- loader_time
After each epoch, the metrics are compared with the values of the preceding epoch. Better values are shown in green, worse values in red.
It's not necessary to compare the individual metrics values, because Google provides a valuable tool called TensorBoard to view the training progress in graphical format.
After the evaluation at the end of the first epoch, the file containing the values of the 83.059.756 model parameters is saved as best_model.pth
and a copy of this file is saved as best_model_1233.pth
. If the model at the next evaluations is more accurate, both files are replaced. The current number of steps is updated in the filename of the copy, for example best_model_2466.pth
, best_model_3699.pth
, best_model_6165.pth
etc.
After 10.000 steps (this configuration parameter is defined in the training script), the current model is saved as checkpoint_10000.pth
. The same is done at step 20.000, 30.000 and so on.