Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
executor_dataloader.py		executor_dataloader.py
executor_utils.py		executor_utils.py
export_transpose_ir.py		export_transpose_ir.py
prepare_dataset.py		prepare_dataset.py
prepare_model.sh		prepare_model.sh
requirements.txt		requirements.txt
run_benchmark.sh		run_benchmark.sh
run_executor.py		run_executor.py
run_qa.py		run_qa.py
trainer_qa.py		trainer_qa.py
utils_qa.py		utils_qa.py

README.md

Sparse model Step-by-Step

Here is a example from pruning a distilbert base model using group lasso during a distillation process to get sparse model, and then inference with Transformers-accelerated Library which is a high-performance operator computing library. Overall, get performance improvement.

Prerequisite

1. Installation

1.1 Install python environment Create a new python environment

conda create -n <env name> python=3.8
conda activate <env name>

Check the gcc version using $gcc-v, make sure the gcc version is higher than 7.0. If not, you need to update gcc by yourself. Make sure the cmake version is 3 rather than 2. Make sure you have the autoconf installed. If not, you need to install autoconf by yourself. If not, you need to install cmake.

cmake --version
conda install cmake
sudo apt install autoconf

Install Intel Extension for Transformers from Source Code

cd <intel_extension_for_transformers_folder>
git submodule update --init --recursive
python setup.py install

Install package for examples

cd <intel_extension_for_transformers_folder>/examples/deployment/neural_engine/sparse/distilbert_base_uncased
pip install -r requirements.txt

Note: Recommend install protobuf <= 3.20.0 if use onnxruntime <= 1.11

1.2 Environment variables Preload libjemalloc.so can improve the performance when multi instance.

export LD_PRELOAD=<intel_extension_for_transformers_folder>/intel_extension_for_transformers/backends/neural_engine/executor/third_party/jemalloc/lib/libjemalloc.so

Using weight sharing can save memory and improve the performance when multi instance.

export WEIGHT_SHARING=1
export INST_NUM=<inst num>

2. Prepare Dataset and pretrained model

2.1 Get dataset

python prepare_dataset.py --dataset_name=squad --output_dir=./data

2.2 Get sparse model

Use the sparse model we publiced on huggingface which is distilbert base on SQuADv1.1 with sparse ratio 80% on 1X4 block(include int8 onnx model and int8 Neural Engine IR). You can also get INT8 ONNX sparse model from optimization module by setting precision=int8, command as follows:

bash prepare_model.sh --input_model=Intel/distilbert-base-uncased-squadv1.1-sparse-80-1X4-block --dataset_name=squad --task_name=squad --output_dir=./model_and_tokenizer --precision=int8

Then you can generate transposed sparse model to get better performance, command as follows:

python export_transpose_ir.py --input_model=./model_and_tokenizer/int8-model.onnx

Benchmark

Neural Engine will automatically detect weight structured sparse ratio, as long as it beyond 70% (since normaly get performance gain when sparse ratio beyond 70%), Neural Engine will call Transformers-accelerated Libraries and high performance layernorm op with transpose mode to improve inference performance.

2.1 accuracy run python

GLOG_minloglevel=2 python run_executor.py --input_model=./sparse_int8_ir  --tokenizer_dir=./model_and_tokenizer --mode=accuracy --data_dir=./data --batch_size=8

or run shell

bash run_benchmark.sh --input_model=./sparse_int8_ir  --tokenizer_dir=./model_and_tokenizer --mode=accuracy --data_dir=./data --batch_size=8

2.2 performance run python

GLOG_minloglevel=2 python run_executor.py --input_model=./sparse_int8_ir --mode=performance --batch_size=8 --seq_len=128

or run shell

bash run_benchmark.sh --input_model=./sparse_int8_ir  --mode=performance --batch_size=8 --seq_len=128

or run C++ The warmup below is recommended to be 1/10 of iterations and no less than 3.

export GLOG_minloglevel=2
export OMP_NUM_THREADS=<cpu_cores>
export DNNL_MAX_CPU_ISA=AVX512_CORE_AMX
export UNIFIED_BUFFER=1
numactl -C 0-<cpu_cores-1> <intel_extension_for_transformers_folder>/intel_extension_for_transformers/backends/neural_engine/bin/neural_engine
--batch_size=<batch_size> --iterations=<iterations> --w=<warmup>
--seq_len=128 --config=./sparse_int8_ir/conf.yaml --weight=./sparse_int8_ir/model.bin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distilbert_base_uncased

distilbert_base_uncased

README.md

README.md

executor_dataloader.py

executor_dataloader.py

executor_utils.py

executor_utils.py

export_transpose_ir.py

export_transpose_ir.py

prepare_dataset.py

prepare_dataset.py

prepare_model.sh

prepare_model.sh

requirements.txt

requirements.txt

run_benchmark.sh

run_benchmark.sh

run_executor.py

run_executor.py

run_qa.py

run_qa.py

trainer_qa.py

trainer_qa.py

utils_qa.py

utils_qa.py

README.md

Sparse model Step-by-Step

Prerequisite

1. Installation

2. Prepare Dataset and pretrained model

2.1 Get dataset

2.2 Get sparse model

Benchmark

Files

distilbert_base_uncased

Directory actions

More options

Directory actions

More options

Latest commit

History

distilbert_base_uncased

Folders and files

parent directory

Sparse model Step-by-Step

Prerequisite

1. Installation

2. Prepare Dataset and pretrained model

2.1 Get dataset

2.2 Get sparse model

Benchmark