The wav2vec 2.0 model is large (92M parameters) and requires a significant amount of resources. The objective is to minimize the model size, enhance inference speed, and deploy it on a cloud platform as a Speech-to-text service.
Slide: Slide
Architecture Image
-
Implemented model knowledge distillation (from 92M to 52M parameters) and converted the model to ONNX to achieve x2 faster inference while retain a moderate accuracy.
-
Utilized the Triton server backend to host the model on GCP. The setup involved storing the Docker container in the Container Registry and the model in Cloud Storage, creating an Instance Group from a Template, configuring Load Balancing, and enabling Auto Scaling.
docker run --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:23.05-py3 tritonserver --model-repository=/models
docker build -t triton-client:v1 ./client
docker run -it --net=host -v ${PWD}:/workspace/ triton-client:v1
python client.py
# for performance analytic
perf_analyzer -m wav2vec -u 34.160.133.47:80 --concurrency-range 1:4 --shape input:1,8000
Server performance test
4 CPU
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 3.26647 infer/sec, latency 303771 usec
Concurrency: 2, throughput: 6.44571 infer/sec, latency 308146 usec
Concurrency: 3, throughput: 9.8324 infer/sec, latency 304842 usec
Concurrency: 4, throughput: 12.693 infer/sec, latency 314371 usec
1 GPU Tesla T4
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 4.05259 infer/sec, latency 247329 usec
Concurrency: 2, throughput: 8.1353 infer/sec, latency 246981 usec
Concurrency: 3, throughput: 12.1179 infer/sec, latency 247838 usec
Concurrency: 4, throughput: 16.1339 infer/sec, latency 248269 usec
Authors: Huy Nguyen
- Github: Huy1711
- Email: nguyenduchuy1711@gmail.com
Advisors: Ba Ngoc