Skip to content

Latest commit

 

History

History
129 lines (112 loc) · 7.12 KB

File metadata and controls

129 lines (112 loc) · 7.12 KB

DLRM v2 Training

DLRM v2 Training best known configurations with Intel® Extension for PyTorch.

Model Information

Use Case Framework Model Repo Branch/Commit/Tag Optional Patch
Training PyTorch https://github.com/facebookresearch/dlrm/tree/main/torchrec_dlrm - -

Pre-Requisite

  • Host has 4 Intel® Data Center GPU Max and have two Tiles for each

  • Host has installed latest Intel® Data Center GPU Max Series Drivers https://dgpu-docs.intel.com/driver/installation.html

  • The following Intel® oneAPI Base Toolkit components are required:

    • Intel® oneAPI DPC++ Compiler (Placeholder DPCPPROOT as its installation path)
    • Intel® oneAPI Math Kernel Library (oneMKL) (Placeholder MKLROOT as its installation path)
    • Intel® oneAPI MPI Library
    • Intel® oneAPI TBB Library

    Follow instructions at Intel® oneAPI Base Toolkit Download page to setup the package manager repository.

Prepare Dataset

After downloading and uncompressing the [Criteo 1TB Click Logs dataset](consisting of 24 files from day 0 to day 23), process the raw tsv files into the proper format for training by running ./scripts/process_Criteo_1TB_Click_Logs_dataset.sh with necessary command line arguments.

Example usage:

bash ./scripts/process_Criteo_1TB_Click_Logs_dataset.sh
./criteo_1tb/raw_input_dataset_dir
./criteo_1tb/temp_intermediate_files_dir
./criteo_1tb/numpy_contiguous_shuffled_output_dataset_dir The script requires 700GB of RAM and takes 1-2 days to run. We currently have features in development to reduce the preproccessing time and memory overhead. MD5 checksums of the expected final preprocessed dataset files are in md5sums_preprocessed_criteo_click_logs_dataset.txt.

the final dataset dir will be like below: dataset_dir |_day_0_dense.npy |_day_0_labels.npy |_day_0_sparse_multi_hot.npz

this folder will be used as the parameter DATASET_DIR later

Training

  1. git clone https://github.com/IntelAI/models.git
  2. cd models/models_v2/pytorch/torchrec_dlrm/training/gpu
  3. Create virtual environment venv and activate it:
    python3 -m venv venv
    . ./venv/bin/activate
    
  4. Run setup.sh
    ./setup.sh
    
  5. Install the latest GPU versions of torch, torchvision and intel_extension_for_pytorch:
python -m pip install torch==<torch_version> torchvision==<torchvision_version> intel-extension-for-pytorch==<ipex_version> --extra-index-url https://pytorch-extension.intel.com/release-whl-aitools/
  1. Set environment variables for Intel® oneAPI Base Toolkit: Default installation location {ONEAPI_ROOT} is /opt/intel/oneapi for root account, ${HOME}/intel/oneapi for other accounts
    source {ONEAPI_ROOT}/compiler/latest/env/vars.sh
    source {ONEAPI_ROOT}/mkl/latest/env/vars.sh
    source {ONEAPI_ROOT}/tbb/latest/env/vars.sh
    source {ONEAPI_ROOT}/mpi/latest/env/vars.sh
    source {ONEAPI_ROOT}/ccl/latest/env/vars.sh
  2. Setup required environment paramaters
Parameter export command
MULTI_TILE export MULTI_TILE=True (True)
PLATFORM export PLATFORM=Max (Max)
DATASET_DIR export DATASET_DIR=
BATCH_SIZE (optional) export BATCH_SIZE=65536
PRECISION (optional) export PRECISION=BF16 (BF16, FP32 and TF32 are supported for Max)
OUTPUT_DIR (optional) export OUTPUT_DIR=$PWD
  1. Run run_model.sh

Output

Multi-tile output will typically looks like:

[7] 2024-01-10 22:22:36,284 - __main__ - INFO - avg training time per iter at ITER: 45, 0.07149526278177896 s
[7] 2024-01-10 22:22:36,284 - __main__ - INFO - Total number of iterations: 50
[2] 2024-01-10 22:22:36,292 - __main__ - INFO - avg training time per iter at ITER: 45, 0.07994737095303006 s
[2] 2024-01-10 22:22:36,292 - __main__ - INFO - Total number of iterations: 50
[0] 2024-01-10 22:22:36,293 - __main__ - INFO - avg training time per iter at ITER: 45, 0.08237933582729763 s
[0] 2024-01-10 22:22:36,294 - __main__ - INFO - Total number of iterations: 50
[1] 2024-01-10 22:22:36,296 - __main__ - INFO - avg training time per iter at ITER: 45, 0.08394240803188747 s
[1] 2024-01-10 22:22:36,296 - __main__ - INFO - Total number of iterations: 50
[6] 2024-01-10 22:22:36,304 - __main__ - INFO - avg training time per iter at ITER: 45, 0.09519488016764323 s
[6] 2024-01-10 22:22:36,304 - __main__ - INFO - Total number of iterations: 50
[5] 2024-01-10 22:22:36,306 - __main__ - INFO - avg training time per iter at ITER: 45, 0.09369233979119194 s
[5] 2024-01-10 22:22:36,306 - __main__ - INFO - Total number of iterations: 50
[3] 2024-01-10 22:22:36,309 - __main__ - INFO - avg training time per iter at ITER: 45, 0.09690533743964301 s
[3] 2024-01-10 22:22:36,309 - __main__ - INFO - Total number of iterations: 50
[4] 2024-01-10 22:22:36,339 - __main__ - INFO - avg training time per iter at ITER: 45, 0.11025158564249675 s
[4] 2024-01-10 22:22:36,339 - __main__ - INFO - Total number of iterations: 50
[0] 2024:01:10-22:22:37:(38583) |CCL_INFO| finalize atl-mpi
[0] 2024:01:10-22:22:37:(38583) |CCL_INFO| finalized atl-mpi
[3] 2024:01:10-22:22:37:(38586) |CCL_INFO| finalizing level-zero
[4] 2024:01:10-22:22:37:(38587) |CCL_INFO| finalizing level-zero
[3] 2024:01:10-22:22:37:(38586) |CCL_INFO| finalized level-zero
[2] 2024:01:10-22:22:37:(38585) |CCL_INFO| finalizing level-zero
[6] 2024:01:10-22:22:37:(38589) |CCL_INFO| finalizing level-zero
[0] 2024:01:10-22:22:37:(38583) |CCL_INFO| finalizing level-zero
[7] 2024:01:10-22:22:37:(38590) |CCL_INFO| finalizing level-zero
[4] 2024:01:10-22:22:37:(38587) |CCL_INFO| finalized level-zero
[5] 2024:01:10-22:22:37:(38588) |CCL_INFO| finalizing level-zero
[2] 2024:01:10-22:22:37:(38585) |CCL_INFO| finalized level-zero
[6] 2024:01:10-22:22:37:(38589) |CCL_INFO| finalized level-zero
[0] 2024:01:10-22:22:37:(38583) |CCL_INFO| finalized level-zero
[1] 2024:01:10-22:22:37:(38584) |CCL_INFO| finalizing level-zero
[7] 2024:01:10-22:22:37:(38590) |CCL_INFO| finalized level-zero
[5] 2024:01:10-22:22:37:(38588) |CCL_INFO| finalized level-zero
[1] 2024:01:10-22:22:37:(38584) |CCL_INFO| finalized level-zero

Final results of the inference run can be found in results.yaml file.

results:
 - key: throughput
   value: 594422.29
   unit: samples/s
 - key: accuracy
   value: None
   unit: AUROC