Skip to content

Configuration Walkthrough

musl edited this page May 14, 2024 · 42 revisions

In this document each configuration option for neosr will be explained. Templates can be used for convenience.

Initial Notes

  • Avoid special characters in both your paths and on filenames. Although parsing UTF-8 isn't a problem in most cases, it can still potentially cause problems.

  • Prefer full paths over of relative paths. Both can be used, but full paths avoid user confusion.

  • Make sure to use slashes in the path that match your system. For Windows, backslashes should be used (\). For Unix-like systems (OSX and Linux), you should use normal slashes (/).

  • Use single quotation marks (')for all paths (especially if they contain spaces or special characters in it).

  • Sub-directories are parsed by default.

  • Pay attention to indentation. The configuration is done in yaml (for now), so if the indentation is wrong, you may find problems.

  • Do not mix OTF degradations with paired and default. OTF should always be used with model otf and dataloader otf.

Launch Options

This section describes the relevant launch options.


-opt

The -opt argument specifies the configuration file. It must be in YAML format.

python train.py -opt config.yml

This option is required.


--auto_resume

The --auto_resume argument will resume training if the name option in your config file corresponds to an existing folder in /experiments and if a model is found under the /experiments/modelname/models/ folder.

python train.py --auto_resume -opt config.yml

--launcher

The --launcher argument specifies the job launcher. This is only useful if you're doing distributed training (multiple GPUs). Possible options are none, pytorch and slurm. See the distributed training section for more information.

python --launch slurm -opt config.yml

Header Options


name

The name option sets the folder name where your training files will be stored. It's a convention to use a prefix based on the scale factor of the model you're training:

name: 4x_mymodel

model_type

The model_type option specifies which model should be used. If you are training with a paired or single dataset, you should set it to default:

model_type: default

If you want to use on-the-fly degradations, set it to otf instead.


scale

The scale option sets the scale ratio of your generator. It can be 1x, 2x or 4x:

scale: 4

num_gpu

The num_gpu sets the number of GPUs. You don't have to specify this option, unless you're doing distributed training. Default is 'auto', which gets the number of gpus using torch.cuda.device_count.

num_gpu: auto

use_amp and bfloat16

The use_amp option enables Automatic Mixed Precision to speed up training. If you are using a GPU with tensor cores (Nvidia Turing or higher), using AMP is recommended. The bfloat16 option sets the dtype to BFloat16 instead of the default float16. Using this is recommended if your GPU has support (Nvidia Ampere or higher).

Important

Beware AMP can cause loss of precision and may cause instabilities. Not all functions are yet implemented in bfloat16, so if you find an error while using it, please update to Pytorch Nightly and report on issues.

# Turing or above
use_amp: true

# Ampere or above
bfloat16: true

fast_matmul

This option specifies to use TF32 or double bfloat16 (determined by your hardware and the heuristics of set_float32_matmul_precision) in all float32 precision operations. In practice, it increases performance without affecting final results in most cases.

fast_matmul: true

compile

The compile option is experimental. This option enables pytorch's new torch.compile(), which can speed up training.

compile: true

Note

For now, only linux has support for torch.compile, due to Triton not being officially supported on Windows yet.


manual_seed

The manual_seed option enables deterministic training. You should use it if your goal is to make precise tests/comparisons. It is recommended that you use manual seed values =>1024.

Important

If you are not making experiments and just want to train a real-world model, leave this option commented, otherwise training performance will decrease significantly.

manual_seed: 1024

Note: Using high seeds is recommended because of torch.Generator.


Dataset options

This section describes the options within:

datasets:
    train:

(dataset) type

The type option specifies the type of dataset loader. Possible options are paired, single and otf. The single type should only be used for inference, not training. The paired option is the default one and will only work if you have LQ images set in dataroot_lq. The otf type is meant to be used together with model: otf.

datasets:
    train:
        type: paired # For paired datasets
        #type: otf # For on-the-fly degradations

dataroot_gt, dataroot_lq

The dataroot_gt and dataroot_lq options are the folder paths to your dataset. This can be either normal images or an LMDB database. The "gt" (ground-truth) are the ideal images, the ones you want your model to transform your images to. The "lq" (low quality) images are the degraded ones. For a folder with images, just include the path:

dataroot_gt: 'C:\My_Images_Folder\gt\'
dataroot_lq: 'C:\My_Images_Folder\lq\'

If you're using LMDB, both paths should end with the .lmdb suffix:

dataroot_gt: '/home/user/dataset_gt.lmdb'
dataroot_lq: '/home/user/dataset_lq.lmdb'

meta_info

The meta_info (optional) option is a text file describing the image file names. This is optional, but recommended to avoid unexpected training aborts due to dataset errors such as file name mismatches.

To generate the meta_info, you can use the script generate_meta_info.py.

Note

If you use create_lmdb.py to convert all your images into an LMDB database, the meta_info option is not necessary, as the script will automatically generate and link it.

meta_info: 'C:\meta_info.txt'

io_backend

The io_backend type option has two possible variables: disk or lmdb. If you're using a folder with images, config should be:

datasets:
    train:
        io_backend:
            type: disk

Or if you're using LMDB:

datasets:
    train:
        io_backend:
            type: lmdb

gt_size

The gt_size is one of the most important options you have to change. It sets the size that each image will be cropped before being sent to the network. A random area of each image is cropped at every new epoch.

Notes on gt_size:

  • gt_size is the crop size being applied to your GT. The LQ pair will be cropped to gt_size divided by your scale ratio. For example, if you set gt_size: 128 and scale: 4 that means your GT will be 128px and your LQ will be 32px. This is important, because if you have a different scale you have to adapt your gt_size, otherwise you might run out of VRAM due to your LQ crop being bigger than before. For example, using gt_size: 128 with scale: 1 will lead to your LQ crop being the same as your GT, which means it will consume a great amount of VRAM.

  • Commonly used constant values for gt_size are:

    • For a 4x model: 128, 192, 256
    • For a 2x model: 64, 96, 128
    • For a 1x model: 32, 48 and 64

  • Depending on the arch you're using, you may encounter tensor size mismatches and other problems with some gt_size values. In general, multiples of 8 or 16 should work on most networks.

  • For transformers, your gt_size must be divisible by the window size. Standard values for window size are 8, 12, 16, 24 and 32.

  • Increasing gt_size will lead to better end results (better model restoration accuracy), but VRAM usage will increase quadratically.

gt_size: 128

batch_size

The batch_size option specifies the number of images to feed the network in each iteration.

Notes on batch_size:

  • Large batches have normalizing effect, i.e. training becomes more stable.
  • Research shows that the batch size not only stabilizes training, but also makes the network learn faster. It may also improve the accuracy of the final restoration, although this depends on the optimizer you're using.
  • Common batch_size values are: 4, 8 and 16. Anything higher than 64 can be considered "high batch" (in research).
  • batch_size sets batches per gpu.

accumulate

The accumulate option specifies the number of batches to be accumulated, also known as Gradient Accumulation. Using this option allows for effectively trading training speed for less vram usage. Your batch_size number will be multiplied by the accumulate value to train at that resulting batch. For example: if batch_size is 2 and accumulate is 8, your effective batch will be 16. Default: 1.


color: y

The color option specifies to convert all dataset images to grayscale. This is only useful if you're training a monochrome model, and should not be on your configuration file at all unless you want your model to generate monochrome images. Note: you need to change the network options to match the number of channels of your dataset. If you use color: y, the number of input and output channels of your network should be set to 1.

color: y

use_hflip, use_rot

The use_hflip and use_rot options are augmentations. It will rotate and flip images during training to increase variety. This is a standard basic augmentation that has been shown to improve models.

use_hflip: true
use_rot: true

augmentations, aug_prob

The augmentations and aug_prob specifies to use augmentations and their probability, respectively. Currently supported are MixUp, CutMix, ResizeMix and CutBlur. This option is specified as a list. Each probability value corresponds to their respective position in the augmentation list. For example:

datasets:
  train:
    augmentation: ['none', 'mixup', 'cutmix', 'resizemix', 'cutblur']
    aug_prob: [0, 0.5, 0.5, 0.7, 0.7]

The configuration above will run all 4 augmentations, giving probability 0 to none (meaning some augmentation will always be applied), 0.5 for MixUp and CutMix (50% chance of applying), and 0.7 for ResizeMix and CutBlur (70% chance). Make sure the amount of probability values and the amount of augmentation types are the same.

Note

CutBlur is meant to be applied to real-world SR. If applied to bicubic-only it may cause undesired effects.


num_worker_per_gpu

The num_worker_per_gpu option is the number of threads used by the Pytorch Dataloader.

num_worker_per_gpu: 6

dataset_enlarge_ratio

The dataset_enlarge_ratio option is used to artificially increase the size of the dataset. If your dataset is too small, training will reach an epoch too fast, causing slowdowns. Using this option will virtually multiply the dataset by N times, so epochs will take longer to reach.

dataset_enlarge_ratio: 10

Validation

Warning

By default. validation doesn't tile the inputs. This means if your val images have large resolution, you might run out of VRAM while validation is running. For this reason, it is recommended that you tile all your validation images to smaller resolutions (such as 256x256 or 512x512) before starting training. Alternatively, use the tile option as mentioned bellow.

The validation options, when enabled, will automatically run your model in a folder of images every time the val_freq iter is reached. For example:

datasets:
    train:
        val:
            name: any_name
            type: single
            dataroot_lq: '/folder/path'
            io_backend:
                type: disk

val:
    val_freq: 1000
    save_img: true

The above configuration will perform inference on the dataroot_lq folder whenever it reaches 1000 iterations (val_freq: 1000). Alternatively, you can use a paired validation set (both GT and LQ) and calculate metrics such as PSNR and SSIM:

datasets:
    train:
        val:
            name: any_name
            type: paired
            dataroot_gt: '/folder/path/gt/'
            dataroot_lq: '/folder/path/lq/'
            io_backend:
                type: disk

val:
    val_freq: 1000
    save_img: true
    tile: 200
    metrics:
        psnr:
            type: calculate_psnr
        ssim:
            type: calculate_ssim
        dists:
            type: calculate_dists
            better: lower

The option tile sets the number of tiles each image will be cut to during validation. This might prevent out of memory errors. Validation results are saved in experiments/model_name/visualization/. The metric value can be seen on the training log file, and/or with tensorboard or wandb (see Logger options below).


path

The path options describe the path for pretrained models or resume state.

# Generator Pretrain
pretrain_network_g: '/path/to/pretrain.pth'
# Discriminator Pretrain
pretrain_network_d: '/path/to/pretrain.pth'

If you want to use a pretrain that has a different upscale ratio and/or you want to load a pretrain that was trained on a slight different version of the arch, you can use the following option:

strict_load_g: false

Note

Unless you have a very specific need, do not change the configuration above.

If you have a .state file that you want to load, comment out the pretrain_network_* option and use resume_state instead:

resume_state: '/path/to/pretrain.state'

network_g and network_d

These options describe which network architecture should be used. For a list of supported architectures, see the neosr readme.md. Unless the template files has some network parameter explicitly commented, all network parameters are set to defaults based on their research papers. This means that you don't need to manually type any parameters, just use their names. For example:

network_g:
    type: atd_light

network_d:
    type: patchgan

The above option will train the ATD-light generator with the PatchGAN discriminator.

Important

Some networks have a parameter to specify the upscaling factor. These should be set to the same value as your scale option. The name of this parameter varies for each arch (upsampling, upscale, etc), see arch-specific options. By default, it's will always be set to 4, so if you're training a 2x model make sure this parameter is the same.


print_network

This option is for logging only. If set to true, it will prince the whole network in the terminal and save to your training log file.

print_network: false

Train

These options describe the main training options, such as optimizers and losses.


optim_g and optim_d

The optim_ options set the optimizers for the generator and discriminator, and their options. For the supported optimizers, see the readme. For their respective options, see pytorch documentation and pytorch-optimizer.

train:
    optim_g:
        type: adamw
        lr: !!float 1e-4
        weight_decay: 0
        betas: [0.9, 0.99]
        fused: true
    optim_d:
        type: adamw
        lr: !!float 1e-4
        weight_decay: 0
        betas: [0.9, 0.99]
        fused: true

The above option will set AdamW to a learning rate of 1e-4 (scientific notation).

Note

The fused: true option can only be used with Adam and AdamW and is experimental. Some networks may not work properly when set to true.


scheduler

This option sets the learning rate scheduler. Supported types are MultiStepLR and CosineAnnealing.

train:
    scheduler:
        type: multisteplr
        milestones: [60000, 120000]
        gamma: 0.5

The above option sets the MultiStepLR scheduler to reduce the learning by gamma: 0.5 at iter counts of 60k and 120k. Example using CosineAnnealing:

train:
    scheduler:
        type: cosineannealing
        T_max: 350000
        eta_min: !!float 4e-5

The setting above will set CosineAnnealing scheduler, reducing learning rate to 4e-5 when it reaches 350k iters.

Note

For more information, see the pytorch documentation


warmup_iter

This option linearly ramps up the learning rate for the specified iter count. For example:

train:
    warmup_iter: 10000

If you start training with a learning rate of 1e-4 using the above option, the learning rate will start from 0 and increase to 1e-4 (linearly) when it reaches 10k iterations.

This technique is used to reduce overfitting when fine-tuning models. The reference value is 2% the total iters you want to train your model for. If unsure, use value between 3200 and 10000. If -1 is specified, warmup is disabled.


grad_clip

This option disables Gradient Clipping. By default, grad_clip is enabled, however it can be disabled by passing False:

train:
    grad_clip: false

Gradient Clipping provides more stable training and allows for training with higher learning rates.


total_iter

Sets the total number of iterations for training. When the total_iter value is reached, the model will stop the training script and save the last models.

total_iter: 500000 # end of training will be 500k iter

net_d_iters and net_d_init_iters

This option defines when to start and stop the discriminator.

The net_d_init_iters sets when the discriminator is enabled:

net_d_init_iters: 80000

The above option will start the discriminator at 80k iters.

The net_d_iter sets the total iter count for the discriminator.

net_d_iter: 500000

The above option will stop the discriminator at 500k iter count.


Losses

For options on all losses, please read the dedicated wiki page.


Logger

These options describe the logger configuration.


print_freq

This sets the terminal and log file printing of training information.

logger:
    print_freq: 100

The above option will print training information at every 100 iterations.


save_checkpoint_freq

This option sets the frequency of saving model files and state file.

logger:
    save_checkpoint_freq: 1000

The above option will save models and state at every 1k iter count.


use_tb_logger and wandb

This option enables to use tensorboard. A folder will be created inside experiments/ containing files needed to initialize tensorboard.

logger:
    use_tb_logger: true

For details on using tensorboard, see the documentation.

Alternatively, you can use wandb by using the following option:

logger:
    use_tb_logger: true
    wandb:
        project: "experiments/tb_logger/project/"
        resume_id: 1

The option use_tb_logger: true is required to use wandb.


Distributed Training

These options describe the distributed training configuration.


backend

This option specifies the backend to use for distributed training.

dist_params:
    backend: nccl
    port: 29500

The above option will set up distributed training using the nvidia nccl library on port 29500. You can also launch training with slurm by passing a command line argument:

pytorch train.py --launcher slurm -opt options.yml