[Sven Kreiss](https://www.svenkreiss.com/), 2020

# Training

See {doc}`datasets` for instructions about the datasets.

The exact training command that was used for a model is in the first
line of the training log file.


## ShuffleNet

ShuffleNet models are trained without ImageNet pretraining:

```sh
time CUDA_VISIBLE_DEVICES=0,1 python3 -m openpifpaf.train \
  --lr=0.1 \
  --momentum=0.9 \
  --epochs=150 \
  --lr-warm-up-epochs=1 \
  --lr-decay 120 \
  --lr-decay-epochs=20 \
  --lr-decay-factor=0.1 \
  --batch-size=32 \
  --square-edge=385 \
  --lambdas 1 1 0.2   1 1 1 0.2 0.2    1 1 1 0.2 0.2 \
  --auto-tune-mtl \
  --weight-decay=1e-5 \
  --update-batchnorm-runningstatistics \
  --ema=0.01 \
  --basenet=shufflenetv2k16w \
  --headnets cif caf caf25

# for improved performance, take the epoch150 checkpoint and train with
# extended-scale and 10% orientation invariance:
time CUDA_VISIBLE_DEVICES=0,1 python3 -m openpifpaf.train \
  --lr=0.05 \
  --momentum=0.9 \
  --epochs=250 \
  --lr-warm-up-epochs=1 \
  --lr-decay 220 \
  --lr-decay-epochs=30 \
  --lr-decay-factor=0.01 \
  --batch-size=32 \
  --square-edge=385 \
  --lambdas 1 1 0.2   1 1 1 0.2 0.2    1 1 1 0.2 0.2 \
  --auto-tune-mtl \
  --weight-decay=1e-5 \
  --update-batchnorm-runningstatistics \
  --ema=0.01 \
  --checkpoint outputs/shufflenetv2k16w-200504-145520-cif-caf-caf25-d05e5520.pkl --extended-scale --orientation-invariant=0.1
```

You can refine an existing model with the `--checkpoint` option.

For large models, reduce the batch size and learning rate by the same factor:

```sh
time CUDA_VISIBLE_DEVICES=0,1 python3 -m openpifpaf.train \
  --lr=0.025 \
  --momentum=0.9 \
  --epochs=200 \
  --lr-warm-up-epochs=1 \
  --lr-decay 180 \
  --lr-decay-epochs=20 \
  --lr-decay-factor=0.01 \
  --batch-size=16 \
  --square-edge=385 \
  --lambdas 1 1 0.2   1 1 1 0.2 0.2    1 1 1 0.2 0.2 \
  --auto-tune-mtl \
  --weight-decay=1e-5 \
  --update-batchnorm-runningstatistics \
  --ema=0.01 \
  --checkpoint outputs/shufflenetv2k44w-200521-074105-cif-caf-caf25-a35c65dd.pkl --extended-scale --orientation-invariant=0.1
```

## ResNet

ResNet models are initialized with weights pre-trained on ImageNet.
That makes their training characteristics different from ShuffleNet (i.e. they look great at the beginning of training).

```sh
time CUDA_VISIBLE_DEVICES=0,1 python3 -m openpifpaf.train \
  --lr=0.05 \
  --momentum=0.9 \
  --epochs=150 \
  --lr-warm-up-epochs=1 \
  --lr-decay 120 \
  --lr-decay-epochs=20 \
  --lr-decay-factor=0.1 \
  --batch-size=16 \
  --square-edge=385 \
  --lambdas 1 1 0.2   1 1 1 0.2 0.2    1 1 1 0.2 0.2 \
  --auto-tune-mtl \
  --weight-decay=1e-5 \
  --update-batchnorm-runningstatistics \
  --ema=0.01 \
  --basenet=resnet50 \
  --headnets cif caf caf25
```

## Logs

To visualize logs:

```sh
python3 -m openpifpaf.logs \
  outputs/resnet50block5-pif-paf-edge401-190424-122009.pkl.log \
  outputs/resnet101block5-pif-paf-edge401-190412-151013.pkl.log \
  outputs/resnet152block5-pif-paf-edge401-190412-121848.pkl.log
```

To produce evaluation metrics every five epochs and check the directory for new
checkpoints every 5 minutes:

```sh
while true; do \
  CUDA_VISIBLE_DEVICES=0 find outputs/ -name "shufflenetv2k16w-200504-145520-cif-caf-caf25.pkl.epoch??[0,5]" -exec \
    python3 -m openpifpaf.eval_coco --checkpoint {} --long-edge=641 --skip-existing \; \
  ; \
  sleep 300; \
done
```