[Sven Kreiss](https://www.svenkreiss.com/), 2020

# Training

This section introduces training to create your own models. You don't need to do this step 
if you use pre-trained models for {doc}`predict_cli` on your own images.
For training, you will need a dataset. See {doc}`datasets` for instructions about a few popular datasets.

Training a model can take several days even with a good GPU. Many times existing models can be refined to avoid training from scratch.

A quick way to get started is with the training commands of the pre-trained models.
The exact training command that was used for a model is in the first
line of the training log file. Below are a few examples for various backbones.

## ShuffleNet

ShuffleNet models are trained without ImageNet pretraining:

```sh
time CUDA_VISIBLE_DEVICES=0,1 python3 -m openpifpaf.train \
  --lr=0.0001 --momentum=0.98 --b-scale=10.0 \
  --epochs=150 \
  --lr-decay 130 140 \
  --lr-decay-epochs=10 \
  --batch-size=32 \
  --square-edge=385 \
  --weight-decay=1e-5 \
  --update-batchnorm-runningstatistics \
  --basenet=shufflenetv2k16w \
  --headnets cif caf caf25
```

For improved performance, take the epoch150 checkpoint and train with
extended-scale and 10% orientation invariance:

```sh
time CUDA_VISIBLE_DEVICES=0,1 python3 -m openpifpaf.train \
  --lr=0.00005 --momentum=0.98 --b-scale=10.0 \
  --epochs=250 \
  --lr-warm-up-start-epoch=150 \
  --lr-decay 220 240 \
  --lr-decay-epochs=10 \
  --batch-size=32 \
  --square-edge=385 \
  --weight-decay=1e-5 \
  --update-batchnorm-runningstatistics \
  --checkpoint outputs/shufflenetv2k30w-200811-232815-cif-caf-caf25-659c5af6.pkl --extended-scale --orientation-invariant=0.1
```

You can refine an existing model with the `--checkpoint` option.

For large models, reduce the batch size and learning rate by the same factor.

## ResNet

ResNet models are initialized with weights pre-trained on ImageNet.
That makes their training characteristics different from ShuffleNet (i.e. they look great at the beginning of training).

## Logs

To visualize logs:

```sh
python3 -m openpifpaf.logs \
  outputs/resnet50block5-pif-paf-edge401-190424-122009.pkl.log \
  outputs/resnet101block5-pif-paf-edge401-190412-151013.pkl.log \
  outputs/resnet152block5-pif-paf-edge401-190412-121848.pkl.log
```

To produce evaluation metrics every five epochs and check the directory for new
checkpoints every 5 minutes:

```sh
while true; do \
  CUDA_VISIBLE_DEVICES=0 find outputs/ -name "shufflenetv2k16w-200504-145520-cif-caf-caf25.pkl.epoch??[0,5]" -exec \
    python3 -m openpifpaf.eval_coco --checkpoint {} --long-edge=641 --skip-existing \; \
  ; \
  sleep 300; \
done
```