Recently I implemented a rust + onnxruntime inference library for Parakeet-tdt-0.6B v2. As you know, onnxruntime has various hardware backends, including CUDA, OpenVINO, and TensorRT, which can significantly improve the inference performance. However, when I tested the library on my MacBook Pro with an M4 processor, the inference speed was not as fast as expected. Especially compared to the MLX version of parakeet implementation.
In my observation, the coreml backend of onnxruntime is not well optimized. It can not even compete with the CPU backend. And onnxruntime has no support for the metal MPS backend, which is used by the MLX version of parakeet. As my goal is to implement a cross-platform inference library, I would prefer to use an inference engine with a wide range of hardware support.
Therefore, I moved my eyes to GGML, which is the tensor engine for llama.cpp. In ASR area, we know that the whisper.cpp project is built on top of GGML, and it has a good performance. Actually in my investigation, I found that there is a feature request for parakeet implementation on ggml in whisper.cpp project. So I decided to spent some time to take this challenge.
Currently, the encoder part is done. I can verify the output tensor using the parakeet-rs library.
However, I found the ggml implementation is not as efficient as expected. It actually surprised me that the gap is so large. The inference time of the encoder part in parakeet-mlx is 0.001s for my test audio. However, in ggml, the inference time is around 1s. The onnxruntime CPU backend inference time is around 1.5s. So I decided to pause here for more review from public.
- Install python environment for running parakeet-mlx. Because we need to export gguf model from this project.
- Download the parakeet-tdt-0.6B v2 model from huggingface. Please note that it's the mlx version of the model.
- Clone this repository
git clone https://github.com/jason-ni/parakeet.cpp.git
cd parakeet.cpp- Export the gguf model using my script.
python scripts/generate_parakeet_gguf.py- Build the ggml project.
mkdir build
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release- Generate the audio features tensor data.
python scripts/generate_input_data.py- Run the inference.
./build/bin/parakeet_cpp parakeet-tdt-0.6b-v2-float32.gguf assets/pe.bin input.dataThe expected output looks like:
/Users/jason/prj/parakeet.cpp/src/framework.cpp:630:<run> run_schedule took 924304 microseconds
/Users/jason/prj/parakeet.cpp/bin/main.cpp:214:<operator()> tensor data: [
[ 0.011379, 0.009602, 0.055848, ... 0.079716, -0.007456, 0.065984, ]
[ 0.018310, 0.006686, 0.035923, ... -0.065628, -0.012651, 0.051945, ]
[ 0.020104, -0.010873, 0.033619, ... -0.060412, -0.025365, 0.054132, ]
...
[-0.086789, 0.040847, 0.009242, ... -0.005162, -0.032376, -0.023982, ]
[-0.067029, 0.009823, 0.035980, ... -0.018739, -0.033858, 0.008342, ]
[-0.071880, -0.017643, 0.110965, ... -0.091322, -0.004971, -0.038205, ]
],
shape: [1024, 597, 1, 1], type: f32
/Users/jason/prj/parakeet.cpp/bin/main.cpp:215:<operator()> output tensor buft: Metal
/Users/jason/prj/parakeet.cpp/bin/main.cpp:214:<operator()> tensor data: [
[-0.784943, 0.619568, 0.861500, ... 0.998092, 0.060645, 0.998160, ]
[-0.945455, -0.325753, 0.056010, ... 0.998098, 0.060543, 0.998166, ]
[-0.236720, -0.971578, -0.799305, ... 0.998105, 0.060441, 0.998172, ]
...
[ 0.236720, -0.971578, 0.799305, ... 0.998105, -0.060441, 0.998172, ]
[ 0.945455, -0.325753, -0.056010, ... 0.998098, -0.060543, 0.998166, ]
[ 0.784943, 0.619568, -0.861500, ... 0.998092, -0.060645, 0.998160, ]
],
shape: [1024, 1193, 1, 1], type: f32
/Users/jason/prj/parakeet.cpp/bin/main.cpp:215:<operator()> output tensor buft: Metal
duration: 1691640