-
Notifications
You must be signed in to change notification settings - Fork 28
Configuration Walkthrough
In this document each configuration option for neosr
will be explained. Templates can be used for convenience.
-
Avoid special characters in both your paths and on filenames. Although parsing UTF-8 isn't a problem in most cases, it can still potentially cause problems.
-
Prefer full paths over of relative paths. Both can be used, but full paths avoid user confusion.
-
Make sure to use slashes in the path that match your system. For Windows, backslashes should be used (
\
). For Unix-like systems (OSX and Linux), you should use normal slashes (/
). -
Use single quotation marks (
'
)for all paths (especially if they contain spaces or special characters in it). -
Sub-directories are parsed by default.
-
Pay attention to indentation. The configuration is done in yaml (for now), so if the indentation is wrong, you may find problems.
-
Do not mix OTF degradations with
paired
anddefault
. OTF should always be used with modelotf
and dataloaderotf
.
This section describes the relevant launch options.
The -opt
argument specifies the configuration file. It must be in YAML format.
python train.py -opt config.yml
This option is required.
The --auto_resume
argument will resume training if the name
option in your config file corresponds to an existing folder in /experiments
and if a model is found under the /experiments/modelname/models/
folder.
python train.py --auto_resume -opt config.yml
The --launcher
argument specifies the job launcher. This is only useful if you're doing distributed training (multiple GPUs).
Possible options are none
, pytorch
and slurm
. See the distributed training section for more information.
python --launch slurm -opt config.yml
The name
option sets the folder name where your training files will be stored. It's a convention to use a prefix based on the scale factor of the model you're training:
name: 4x_mymodel
The model_type
option specifies which model should be used. If you are training with a paired
or single
dataset, you should set it to default
:
model_type: default
If you want to use on-the-fly degradations, set it to otf
instead.
The scale
option sets the scale ratio of your generator. It can be 1x, 2x or 4x:
scale: 4
The num_gpu
sets the number of GPUs. You don't have to specify this option, unless you're doing distributed training.
Default is 'auto', which gets the number of gpus using torch.cuda.device_count
.
num_gpu: auto
The use_amp
option enables Automatic Mixed Precision to speed up training. If you are using a GPU with tensor cores (Nvidia Turing or higher), using AMP is recommended. The bfloat16
option sets the dtype to BFloat16 instead of the default float16. Using this is recommended if your GPU has support (Nvidia Ampere or higher).
Important
Beware AMP can cause loss of precision and may cause instabilities. Not all functions are yet implemented in bfloat16
, so if you find an error while using it, please update to Pytorch Nightly and report on issues.
# Turing or above
use_amp: true
# Ampere or above
bfloat16: true
This option specifies to use TF32 or double bfloat16 (determined by your hardware and the heuristics of set_float32_matmul_precision
) in all float32 precision operations. In practice, it increases performance without affecting final results in most cases.
fast_matmul: true
The compile
option is experimental. This option enables pytorch's new torch.compile()
, which can speed up training.
compile: true
Note
For now, only linux has support for torch.compile, due to Triton not being officially supported on Windows yet.
The manual_seed
option enables deterministic training. You should use it if your goal is to make precise tests/comparisons. It is recommended that you use manual seed values =>1024.
Important
If you are not making experiments and just want to train a real-world model, leave this option commented, otherwise training performance will decrease significantly.
manual_seed: 1024
Note: Using high seeds is recommended because of torch.Generator
.
This section describes the options within:
datasets:
train:
The type
option specifies the type of dataset loader.
Possible options are paired
, single
and otf
. The single
type should only be used for inference, not training. The paired
option is the default one and will only work if you have LQ images set in dataroot_lq
. The otf
type is meant to be used together with model: otf
.
datasets:
train:
type: paired # For paired datasets
#type: otf # For on-the-fly degradations
The dataroot_gt
and dataroot_lq
options are the folder paths to your dataset. This can be either normal images or an LMDB database.
The "gt" (ground-truth) are the ideal images, the ones you want your model to transform your images to. The "lq" (low quality) images are the degraded ones.
For a folder with images, just include the path:
dataroot_gt: 'C:\My_Images_Folder\gt\'
dataroot_lq: 'C:\My_Images_Folder\lq\'
If you're using LMDB, both paths should end with the .lmdb
suffix:
dataroot_gt: '/home/user/dataset_gt.lmdb'
dataroot_lq: '/home/user/dataset_lq.lmdb'
The meta_info
(optional) option is a text file describing the image file names. This is optional, but recommended to avoid unexpected training aborts due to dataset errors such as file name mismatches.
To generate the meta_info, you can use the script generate_meta_info.py.
Note
If you use create_lmdb.py
to convert all your images into an LMDB database, the meta_info
option is not necessary, as the script will automatically generate and link it.
meta_info: 'C:\meta_info.txt'
The io_backend
type option has two possible variables: disk
or lmdb
. If you're using a folder with images, config should be:
datasets:
train:
io_backend:
type: disk
Or if you're using LMDB:
datasets:
train:
io_backend:
type: lmdb
The gt_size
is one of the most important options you have to change. It sets the size that each image will be cropped before being sent to the network. A random area of each image is cropped at every new epoch.
Notes on gt_size
:
-
gt_size is the crop size being applied to your GT. The LQ pair will be cropped to gt_size divided by your scale ratio. For example, if you set
gt_size: 128
andscale: 4
that means your GT will be 128px and your LQ will be 32px. This is important, because if you have a differentscale
you have to adapt your gt_size, otherwise you might run out of VRAM due to your LQ crop being bigger than before. For example, usinggt_size: 128
withscale: 1
will lead to your LQ crop being the same as your GT, which means it will consume a great amount of VRAM. -
Commonly used constant values for
gt_size
are:- For a 4x model:
128
,192
,256
- For a 2x model:
64
,96
,128
- For a 1x model:
32
,48
and64
- For a 4x model:
-
Depending on the arch you're using, you may encounter tensor size mismatches and other problems with some
gt_size
values. In general, multiples of 8 or 16 should work on most networks. -
For transformers, your
gt_size
must be divisible by the window size. Standard values for window size are 8, 12, 16, 24 and 32. -
Increasing
gt_size
will lead to better end results (better model restoration accuracy), but VRAM usage will increase quadratically.
gt_size: 128
The batch_size
option specifies the number of images to feed the network in each iteration.
Notes on batch_size
:
- Large batches have normalizing effect, i.e. training becomes more stable.
- Research shows that the batch size not only stabilizes training, but also makes the network learn faster. It may also improve the accuracy of the final restoration, although this depends on the optimizer you're using.
- Common batch_size values are: 4, 8 and 16. Anything higher than 64 can be considered "high batch" (in research).
-
batch_size
sets batches per gpu.
The accumulate
option specifies the number of batches to be accumulated, also known as Gradient Accumulation.
Using this option allows for effectively trading training speed for less vram usage. Your batch_size
number will be multiplied by the accumulate
value to train at that resulting batch. For example: if batch_size
is 2 and accumulate
is 8, your effective batch will be 16. Default: 1.
The color
option specifies to convert all dataset images to grayscale. This is only useful if you're training a monochrome model, and should not be on your configuration file at all unless you want your model to generate monochrome images.
Note: you need to change the network options to match the number of channels of your dataset. If you use color: y
, the number of input and output channels of your network should be set to 1
.
color: y
The use_hflip
and use_rot
options are augmentations. It will rotate and flip images during training to increase variety. This is a standard basic augmentation that has been shown to improve models.
use_hflip: true
use_rot: true
The augmentations
and aug_prob
specifies to use augmentations and their probability, respectively. Currently supported are MixUp
, CutMix
, ResizeMix
and CutBlur
. This option is specified as a list. Each probability value corresponds to their respective position in the augmentation
list. For example:
datasets:
train:
augmentation: ['none', 'mixup', 'cutmix', 'resizemix', 'cutblur']
aug_prob: [0, 0.5, 0.5, 0.7, 0.7]
The configuration above will run all 4 augmentations, giving probability 0 to none (meaning some augmentation will always be applied), 0.5 for MixUp and CutMix (50% chance of applying), and 0.7 for ResizeMix and CutBlur (70% chance). Make sure the amount of probability values and the amount of augmentation types are the same.
Note
CutBlur
is meant to be applied to real-world SR. If applied to bicubic-only it may cause undesired effects.
The num_worker_per_gpu
option is the number of threads used by the Pytorch Dataloader.
num_worker_per_gpu: 6
The dataset_enlarge_ratio
option is used to artificially increase the size of the dataset. If your dataset is too small, training will reach an epoch too fast, causing slowdowns. Using this option will virtually multiply the dataset by N times, so epochs will take longer to reach.
dataset_enlarge_ratio: 10
Warning
By default. validation doesn't tile the inputs. This means if your val images have large resolution, you might run out of VRAM while validation is running. For this reason, it is recommended that you tile all your validation images to smaller resolutions (such as 256x256 or 512x512) before starting training. Alternatively, use the tile
option as mentioned bellow.
The validation options, when enabled, will automatically run your model in a folder of images every time the val_freq
iter is reached.
For example:
datasets:
train:
val:
name: any_name
type: single
dataroot_lq: '/folder/path'
io_backend:
type: disk
val:
val_freq: 1000
save_img: true
The above configuration will perform inference on the dataroot_lq
folder whenever it reaches 1000 iterations (val_freq: 1000
).
Alternatively, you can use a paired validation set (both GT and LQ) and calculate metrics such as PSNR and SSIM:
datasets:
train:
val:
name: any_name
type: paired
dataroot_gt: '/folder/path/gt/'
dataroot_lq: '/folder/path/lq/'
io_backend:
type: disk
val:
val_freq: 1000
save_img: true
tile: 200
metrics:
psnr:
type: calculate_psnr
ssim:
type: calculate_ssim
dists:
type: calculate_dists
better: lower
The option tile
sets the number of tiles each image will be cut to during validation. This might prevent out of memory errors.
Validation results are saved in experiments/model_name/visualization/
. The metric value can be seen on the training log file, and/or with tensorboard
or wandb
(see Logger options below).
The path
options describe the path for pretrained models or resume state.
# Generator Pretrain
pretrain_network_g: '/path/to/pretrain.pth'
# Discriminator Pretrain
pretrain_network_d: '/path/to/pretrain.pth'
If you want to use a pretrain that has a different upscale ratio and/or you want to load a pretrain that was trained on a slight different version of the arch, you can use the following option:
strict_load_g: false
Note
Unless you have a very specific need, do not change the configuration above.
If you have a .state
file that you want to load, comment out the pretrain_network_*
option and use resume_state
instead:
resume_state: '/path/to/pretrain.state'
These options describe which network architecture should be used. For a list of supported architectures, see the neosr readme.md
.
Unless the template files has some network parameter explicitly commented, all network parameters are set to defaults based on their research papers. This means that you don't need to manually type any parameters, just use their names. For example:
network_g:
type: atd_light
network_d:
type: patchgan
The above option will train the ATD-light generator with the PatchGAN discriminator.
Important
Some networks have a parameter to specify the upscaling factor. These should be set to the same value as your scale
option.
The name of this parameter varies for each arch (upsampling
, upscale
, etc), see arch-specific options. By default, it's will always be set to 4
, so if you're training a 2x model make sure this parameter is the same.
This option is for logging only. If set to true
, it will prince the whole network in the terminal and save to your training log file.
print_network: false
These options describe the main training options, such as optimizers and losses.
The optim_
options set the optimizers for the g
enerator and d
iscriminator, and their options. For the supported optimizers, see the readme
. For their respective options, see pytorch documentation and pytorch-optimizer.
train:
optim_g:
type: adamw
lr: !!float 1e-4
weight_decay: 0
betas: [0.9, 0.99]
fused: true
optim_d:
type: adamw
lr: !!float 1e-4
weight_decay: 0
betas: [0.9, 0.99]
fused: true
The above option will set AdamW to a learning rate of 1e-4 (scientific notation).
Note
The fused: true
option can only be used with Adam and AdamW and is experimental. Some networks may not work properly when set to true.
This option sets the learning rate scheduler. Supported types are MultiStepLR
and CosineAnnealing
.
train:
scheduler:
type: multisteplr
milestones: [60000, 120000]
gamma: 0.5
The above option sets the MultiStepLR scheduler to reduce the learning by gamma: 0.5
at iter counts of 60k and 120k.
Example using CosineAnnealing:
train:
scheduler:
type: cosineannealing
T_max: 350000
eta_min: !!float 4e-5
The setting above will set CosineAnnealing scheduler, reducing learning rate to 4e-5
when it reaches 350k iters.
Note
For more information, see the pytorch documentation
This option linearly ramps up the learning rate for the specified iter count. For example:
train:
warmup_iter: 10000
If you start training with a learning rate of 1e-4 using the above option, the learning rate will start from 0 and increase to 1e-4 (linearly) when it reaches 10k iterations.
This technique is used to reduce overfitting when fine-tuning models. The reference value is 2% the total iters you want to train your model for. If unsure, use value between 3200
and 10000
. If -1
is specified, warmup is disabled.
This option disables Gradient Clipping. By default, grad_clip
is enabled, however it can be disabled by passing False:
train:
grad_clip: false
Gradient Clipping provides more stable training and allows for training with higher learning rates.
Sets the total number of iterations for training. When the total_iter
value is reached, the model will stop the training script and save the last models.
total_iter: 500000 # end of training will be 500k iter
This option defines when to start and stop the discriminator.
The net_d_init_iters
sets when the discriminator is enabled:
net_d_init_iters: 80000
The above option will start the discriminator at 80k iters.
The net_d_iter
sets the total iter count for the discriminator.
net_d_iter: 500000
The above option will stop the discriminator at 500k iter count.
For options on all losses, please read the dedicated wiki page.
These options describe the logger configuration.
This sets the terminal and log file printing of training information.
logger:
print_freq: 100
The above option will print training information at every 100 iterations.
This option sets the frequency of saving model files and state file.
logger:
save_checkpoint_freq: 1000
The above option will save models and state at every 1k iter count.
This option enables to use tensorboard
. A folder will be created inside experiments/
containing files needed to initialize tensorboard.
logger:
use_tb_logger: true
For details on using tensorboard, see the documentation.
Alternatively, you can use wandb
by using the following option:
logger:
use_tb_logger: true
wandb:
project: "experiments/tb_logger/project/"
resume_id: 1
The option use_tb_logger: true
is required to use wandb
.
These options describe the distributed training configuration.
This option specifies the backend to use for distributed training.
dist_params:
backend: nccl
port: 29500
The above option will set up distributed training using the nvidia nccl
library on port 29500.
You can also launch training with slurm
by passing a command line argument:
pytorch train.py --launcher slurm -opt options.yml