Skip to content

Latest commit

 

History

History
150 lines (120 loc) · 6.46 KB

File metadata and controls

150 lines (120 loc) · 6.46 KB

TensorFlow BERT Large Training

This section has instructions for running BERT Large Training with the SQuAD dataset.

Set the OUTPUT_DIR to point to the directory where all logs will get stored. Set the PRECISION to choose the appropriate precision for training. Choose from fp32, bfloat16, or fp16

Datasets

SQuAD data

Download and unzip the BERT Large uncased (whole word masking) model from the google bert repo. Set the DATASET_DIR to point to this directory when running BERT Large.

mkdir -p $DATASET_DIR && cd $DATASET_DIR
wget https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip
unzip wwm_uncased_L-24_H-1024_A-16.zip

wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json -P wwm_uncased_L-24_H-1024_A-16
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json -P wwm_uncased_L-24_H-1024_A-16

Quick Start Scripts

Script name Description
training_squad.sh Uses mpirun to execute 1 process per socket for BERT Large training with the specified precision (fp32, bfloat16 or fp16). Logs for each instance are saved to the output directory.

TensorFlow BERT Large Pretraining

This section has instructions for running BERT Large Pretraining using Intel-optimized TensorFlow.

Datasets

SQuAD data

Download and unzip the BERT Large uncased (whole word masking) model from the google bert repo. Set the DATASET_DIR to point to this directory when running BERT Large.

mkdir -p $DATASET_DIR && cd $DATASET_DIR
wget https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip
unzip wwm_uncased_L-24_H-1024_A-16.zip

wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json -P wwm_uncased_L-24_H-1024_A-16

Follow instructions to generate BERT pre-training dataset in TensorFlow record file format. The output TensorFlow record files are expected to be located in the dataset directory ${DATASET_DIR}/tf_records. An example for the TF record file path should be ${DATASET_DIR}/tf_records/part-00430-of-00500.

Quick Start Scripts

Script name Description
pretraining.sh Uses mpirun to execute 1 process per socket for BERT Large pretraining with the specified precision (fp32, bfloat16 or fp16). Logs for each instance are saved to the output directory.

Run the model

Setup on baremetal

Setup your environment using the instructions below, depending on if you are using AI Tools:

Setup using AI Tools Setup without AI Tools

To run using AI Tools you will need:

  • numactl
  • unzip
  • wget
  • openmpi-bin (only required for multi-instance)
  • openmpi-common (only required for multi-instance)
  • openssh-client (only required for multi-instance)
  • openssh-server (only required for multi-instance)
  • libopenmpi-dev (only required for multi-instance)
  • horovod==0.27.0 (only required for multi-instance)
  • Activate the `tensorflow` conda environment
    conda activate tensorflow

To run without AI Tools you will need:

  • Python 3
  • intel-tensorflow>=2.5.0
  • git
  • numactl
  • openmpi-bin (only required for multi-instance)
  • openmpi-common (only required for multi-instance)
  • openssh-client (only required for multi-instance)
  • openssh-server (only required for multi-instance)
  • libopenmpi-dev (only required for multi-instance)
  • horovod==0.27.0 (only required for multi-instance)
  • A clone of the AI Reference Models repo
    git clone https://github.com/IntelAI/models.git

Download checkpoints:

wget https://storage.googleapis.com/intel-optimized-tensorflow/models/v1_8/bert_large_checkpoints.zip
unzip bert_large_checkpoints.zip
export CHECKPOINT_DIR=$(pwd)/bert_large_checkpoints

Run on Linux

Set environment variables to specify the dataset directory, precision to run, and an output directory.

# Navigate to the container package directory
cd models

# Set the required environment vars
export PRECISION=<specify the precision to run:fp32, bfloat16 or fp16>
export DATASET_DIR=<path to the dataset>
export OUTPUT_DIR=<directory where log files will be written>
export CHECKPOINT_DIR=<path to the downloaded checkpoints folder>

# Run the container with pretraining.sh quickstart script
./quickstart/language_modeling/tensorflow/bert_large/training/cpu/pretraining.sh

Additional Resources