ResNet50 v1.5 training

This document has instructions for running ResNet50 v1.5 training using Intel-optimized TensorFlow.

Datasets

Note that the ImageNet dataset is used in these ResNet50 v1.5 examples. Download and preprocess the ImageNet dataset using the instructions here. After running the conversion script you should have a directory with the ImageNet dataset in the TF records format.

Set the DATASET_DIR to point to this directory when running ResNet50 v1.5.

Quick Start Scripts

Script name	Description
`training_demo.sh`	Executes a short run using small batch sizes and a limited number of steps to demonstrate the training flow
`training_1_epoch.sh`	Executes a test run that trains the model for 1 epoch and saves checkpoint files to an output directory.
`training_full.sh`	Trains the model using the full dataset and runs until convergence (90 epochs) and saves checkpoint files to an output directory. Note that this will take a considerable amount of time.
`multi_instance_training_demo.sh`	Uses mpirun to execute 2 processes with 1 process per socket with a batch size of 1 for 50 steps.
`multi_instance_training.sh`	Uses mpirun to execute 1 process per socket with a batch size of 1024 for the specified precision (fp32 or bfloat16 or bfloat32 or fp16). Checkpoint files and logs for each instance are saved to the output directory.

Run the model

Setup your environment using the instructions below, depending on if you are using AI Kit:

Setup using AI Kit

Setup without AI Kit

To run using AI Kit you will need:

numactl
openmpi-bin (only required for multi-instance)
openmpi-common (only required for multi-instance)
openssh-client (only required for multi-instance)
openssh-server (only required for multi-instance)
libopenmpi-dev (only required for multi-instance)
horovod==0.27.0 (only required for multi-instance)
Activate the tensorflow conda environment
```
conda activate tensorflow
```

To run without AI Kit you will need:

Python 3
[intel-tensorflow>=2.5.0](https://pypi.org/project/intel-tensorflow/)
git
numactl
openmpi-bin (only required for multi-instance)
openmpi-common (only required for multi-instance)
openssh-client (only required for multi-instance)
openssh-server (only required for multi-instance)
libopenmpi-dev (only required for multi-instance)
horovod==0.27.0 (only required for multi-instance)

A clone of the Model Zoo repo

git clone https://github.com/IntelAI/models.git

Set the environment variables, navigate to your model zoo directory and run quickstart scripts. See the list of quickstart scripts for details on the different options.

# cd to your model zoo directory
cd models

export DATASET_DIR=<path to the ImageNet TF records>
export OUTPUT_DIR=<path to the directory where log files and checkpoints will be written>
export PRECISION=<set the precision to "fp32" or "bfloat16" or "bfloat32" or "fp16">
# For a custom batch size, set env var `BATCH_SIZE` or it will run with a default value.
export BATCH_SIZE=<customized batch size value>

./quickstart/image_recognition/tensorflow/resnet50v1_5/training/cpu/<script name>.sh

Additional Resources

To run more advanced use cases, see the instructions for the available precisions FP32 BFloat16 FP16 for calling the launch_benchmark.py script directly.
To run the model using docker, please see the Intel® Developer Catalog workload container:
https://www.intel.com/content/www/us/en/developer/articles/containers/resnet50v1-5-fp32-training-tensorflow-container.html.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ResNet50 v1.5 training

Datasets

Quick Start Scripts

Run the model

Additional Resources

Files

README.md

Latest commit

History

README.md

File metadata and controls

ResNet50 v1.5 training

Datasets

Quick Start Scripts

Run the model

Additional Resources