The following takes Ubuntu 18.04 as an example to describe how to install MindSpore by pip in a Linux system with a GPU environment. See the official documentation if you want to install it manually or for more information.
Before using the automatic installation script, you need to make sure that the NVIDIA GPU driver is correctly installed on the system. The minimum required GPU driver version of CUDA 10.1 is 418.39. The minimum required GPU driver version of CUDA 11.1 is 450.80.02. The minimum required GPU driver version of CUDA 11.1 is 450.80.02. Execute the following command to check the driver version.
nvidia-smi
If the GPU driver is not installed, run the following command to install it.
sudo apt-get update
sudo apt-get install ubuntu-drivers-common
sudo ubuntu-drivers autoinstall
After the installation is complete, please reboot your system.
The root permission is required because the automatic installation script needs to change the software source configuration and install dependencies via APT. Run the following command to obtain and run the automatic installation script. The automatic installation script only supports the installation of MindSpore>=1.6.0.
wget https://gitee.com/mindspore/mindspore/raw/r2.0.0-alpha/scripts/install/ubuntu-gpu-pip.sh
# install MindSpore 2.0.0-alpha, Python 3.7 and CUDA 11.1
MINDSPORE_VERSION=2.0.0a0 bash -i ./ubuntu-gpu-pip.sh
# to specify Python and MindSpore version, taking Python 3.9 and MindSpore 1.6.0 as examples, use the following manners
# PYTHON_VERSION=3.9 CUDA_VERSION=10.1 MINDSPORE_VERSION=1.6.0 bash -i ./ubuntu-gpu-pip.sh
This script performs the following operations:
-
Change the software source configuration to a HUAWEI CLOUD source.
-
Install the dependencies required by MindSpore, such as GCC.
-
Install Python3 and pip3 via APT and set them as default.
-
Download and install CUDA and cuDNN.
-
Install MindSpore GPU by pip.
-
Install Open MPI if OPENMPI is set to on.
After the automatic installation script is executed, you need to reopen the terminal window to make the environment variables take effect.
Currently, we provide the val.py
script to evaluate the pre-training model on a complete data set. you can use the following commands to evaluate a pre-trained model.
python val.py [--config_path ${CONFIG_FILE_PATH}] [--save_metrics ${SAVE_METRICS_RESULTS_PATH}]
for example,
python val.py --config_path ./configs/basicvsr/val.yaml --save_metrics ./output/quality_metrics.json
Note:The trained parameter file
.ckpt
path needs to be set in the corresponding model configuration file, and the parameter setting can be referred to learn about Configs. If you want to see the inference output image, you can setimg_save_directory
in the yaml filecallback
to save the output image path.
Once you have set the parameters in the yaml configuration file, you can start the training task by following the commands below. For details about parameter Settings, see learn about Configs.
- Train with single GPU
python train.py [--config_path ${CONFIG_FILE_PATH}]
- for example,
python train.py --config_path ./configs/basicvsr/train.yaml
- Distributed Training
mpirun [-n ${DEVICE_NUMBER}] python train.py [--config_path ${CONFIG_FILE_PATH}]
- for example,
mpirun -n 2 python train.py --config_path ./configs/basicvsr/train.yaml
- Train on ModelArts Platform
To run training on the ModelArts cloud platform:
1.Log in to the ModelArts management console.
2.In the navigation pane, choose Training Management > Training Jobs. The training job list is displayed.
3.Click Create Training Job. Then, configure parameters.
mindediting
provide the yaml file of the model to configure parameters. Here is an example to explain how to configure the corresponding parameters.
- Parameter description
- context_mode: Use graph mode or pynative mode.
- Sample yaml file
system:
context_mode: "graph_mode"
...
- Parse parameter setting
python train.py --config_path ./configs/basicvsr/train.yaml
- Corresponding code example
...
path_args, _ = parser.parse_known_args()
default, helper, choices = parse_yaml(path_args.config_path)
...
...
def train(cfg):
init_env(cfg)
...
def init_env(cfg):
...
mode = cfg.system.context_mode.lower()
context_mode = None
if mode == "pynative_mode":
context_mode = context.PYNATIVE_MODE
elif mode == "graph_mode":
context_mode = context.GRAPH_MODE
context.set_context(mode=context_mode,
device_target=cfg.system.device_target)
...
- Parameter description
-
dataset_name: dataset name.
-
input_path: Path of dataset file.
-
train_annotation: training sample.
-
test_annotation: testing sample.
-
gt_subdir: super resolution image.
-
lr_subdir: low resolution image.
-
batch_size: The number of images each batch.
-
epoch_size: train epoch size.
-
num_frames: frame number.
-
num_parallel_workers: Number of workers(threads) to process the dataset in parallel.
- Sample yaml file
dataset:
dataset_name: "vimeo_super_resolution"
input_path: "./vimeo_super_resolution"
train_annotation: 'sep_trainlist.txt'
test_annotation: 'sep_testlist.txt'
gt_subdir: 'sequences'
lr_subdir: 'BIx4'
batch_size: 8
epoch_size: 37
num_frames: 7
num_parallel_workers: 8
...
- Parse parameter setting
python train.py --config_path ./configs/basicvsr/train.yaml
- Corresponding code example
def train(cfg):
...
rank_id = get_device_id()
group_size = get_device_num()
dataset_train = create_dataset(
name=cfg.dataset.dataset_name, root=cfg.dataset.input_path, split='train',
shuffle=True, nframes=cfg.dataset.num_frames, num_shards=group_size, shard_id=rank_id, **cfg.dataset.cfg_dict)
train_operations, train_input_columns, train_output_columns = create_transform(model_name=cfg.model.name,
split='train',
pipeline=cfg.train_pipeline,
**cfg.dataset.cfg_dict)
loader_train = create_loader(
dataset=dataset_train,
batch_size=cfg.dataset.batch_size,
operations=train_operations,
input_columns=train_input_columns,
output_columns=train_output_columns,
split='train',
)
...
- Parameter description
-
name: model name.
-
load_path: initialize master model weights from this checkpoint.
-
need_val: whether to validate during training.
-
spynet_load_path: initialize spynet weights from this checkpoint.
- Sample yaml file
model:
name: "basicvsr"
load_path: "./basicvsr.ckpt"
spynet_load_path: "./spynet_20210409-c6c1bd09.ckpt"
train_params:
need_val: True
- Parse parameter setting
python train.py --config_path ./configs/basicvsr/train.yaml
- Corresponding code example
def train(cfg):
...
# create model
cfg.mode = 'train'
net, eval_network= create_model_by_name(model_name=cfg.model.name, cfg=cfg)
...
- Parameter description
-
name: name of loss function, bce (BinaryCrossEntropy) or ce (CrossEntropy).
-
reduction: apply specific reduction method to the output: 'mean' or 'sum'. Default: 'mean'.
-
weight: class weight. Shape [C]. A rescaling weight applied to the loss of each batch element.
- Sample yaml file
loss:
name: "CharbonnierLoss"
reduction:'mean'
weight: 1.0
...
- Parse parameter setting
python train.py --config_path ./configs/basicvsr/train.yaml
- Corresponding code example
def train(cfg):
...
loss = create_loss(loss_name=cfg.loss.name, **cfg.loss.cfg_dict)
...
- Parameter description
-
name: optimizer name.
-
learning_rate: float or lr scheduler. Fixed and dynamic learning rate are supported.
-
weight_decay: weight decay factor.
- Sample yaml file
optimizer:
name: "Adam"
learning_rate: 1.0e-3
weight_decay: 0.0
...
- Parse parameter setting
python train.py --config_path ./configs/basicvsr/train.yaml
- Corresponding code example
def train(cfg):
...
loss_scale = 1.0 if cfg.loss.amp_level == "O0" else cfg.loss.loss_scale
optimizer = create_optimizer(params=optimizer_params, lr=learning_rate,
opt=cfg.optimizer.name, loss_scale=loss_scale,
**{'beta1': cfg.optimizer.beta1,
'beta2': cfg.optimizer.beta2})
...
- Parameter description
-
name: name of scheduler.
-
base_lr: learning rate value.
-
min_lr: lower lr bound for 'cosine_decay' schedulers.
-
warmup_epochs: epochs to warmup LR, if scheduler supports.
- Sample yaml file
scheduler:
name: "cosine_annealing"
base_lr: 2.0e-4
min_lr: 1.0e-7
warmup_epochs: 0
...
- Parse parameter setting
python train.py --config_path ./configs/basicvsr/train.yaml
- Corresponding code example
def train(cfg):
...
optimizer_params, learning_rate = create_scheduler_by_name(model_name=cfg.model.name, cfg=cfg)
...
- Parameter description
-
PSNR: Peak Signal-to-Noise Ratio
-
reduction: Calculation method.
-
crop_border: Cropped pixels in each edges of an image. These pixels are not involved in the PSNR calculation.
-
input_order: Whether the input order is 'HWC' or 'CHW'.
-
convert_to: Whether to convert the images to other color models. If None, the images are not altered.
-
-
SSIM: Structural similarity
-
reduction: Calculation method.
-
crop_border: Cropped pixels in each edges of an image. These pixels are not involved in the PSNR calculation.
-
input_order: Whether the input order is 'HWC' or 'CHW'.
-
convert_to: Whether to convert the images to other color models. If None, the images are not altered.
-
- Sample yaml file
metric:
PSNR:
reduction: 'avg'
crop_border: 0
input_order: 'CHW'
convert_to: 'y'
...
SSIM:
reduction: 'avg'
crop_border: 0
input_order: 'CHW'
convert_to: 'y'
...
- Parse parameter setting
python train.py --config_path ./configs/basicvsr/train.yaml
- Corresponding code example
def train(cfg):
...
metrics = create_metrics(cfg.metric)
...
- Parameter description
-
keep_checkpoint_max: Maximum number of checkpoint files can be saved.
-
save_epoch_frq: Epochs to save checkpoint.
-
eval_frequency: Steps to evaluate checkpoint.
-
workers_num: worker number.
-
print_frequency: print frequency.
-
ckpt_save_dir: Path to save checkpoint.
- Sample yaml file
callback:
keep_checkpoint_max: 60
save_epoch_frq: 50
eval_frequency: 10
workers_num: 2
print_frequency: 10
ckpt_save_dir: "./ckpt/basicvsr"
- Parse parameter setting
python train.py --config_path ./configs/basicvsr/train.yaml
- Corresponding code example
def train(cfg):
...
config_ck = CheckpointConfig(save_checkpoint_steps=cfg.train_params.save_epoch_frq,
keep_checkpoint_max=cfg.train_params.keep_checkpoint_max)
...
if cfg.train_params.need_val:
callbacks.append(EvalAsInValPyCallBack(cfg, eval_network or net))
...