<a href="https://colab.research.google.com/github/k2-fsa/colab/blob/master/rtf_test_for_zipformer_transducer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction


This colab notebook tests the RTF ([Real-time factor](https://openvoice-tech.net/index.php/Real-time-factor))
of the [Zipformer][zipformer] transducer model from [icefall][icefall].

For the CPU test, we use [sherpa-onnx][sherpa-onnx]

For the GPU test, we infer the RTF from the decoding logs for the librispeech test-clean and test-other dataset.

[icefall]: https://github.com/k2-fsa/icefall

[sherpa-onnx]: https://github.com/k2-fsa/sherpa-onnx

[zipformer]: https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/zipformer



The results are summarized in the following table:


## CPU RTF

Use CPU in this colab notebook:

|model type| weight type|# parameters (M)| RTF|
|---|---|---|---|
|non-streaming|**float32**| [65.55](https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md#non-streaming-1)|0.064|
|non-streaming|**int8**|[65.55](https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md#non-streaming-1)|0.056|
|streaming|**float32**|[66.11](https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md#normal-scaled-model-number-of-model-parameters-66110931-ie-6611-m)|0.18|
|streaming|**int8**|[66.11](https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md#normal-scaled-model-number-of-model-parameters-66110931-ie-6611-m)|0.13|

## GPU RTF

Use V100 with 32GB RAM:

|model type| dataset | RTF|
|---|---|---|
|non-streaming| test-clean |0.00298|
|non-streaming| test-other |0.00321|
|streaming| test-clean |0.00895|
|streaming| test-other |0.01111|

# Display CPU information

In [1]:
%%shell

lscpu

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  2
  On-line CPU(s) list:   0,1
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) CPU @ 2.20GHz
    CPU family:          6
    Model:               79
    Thread(s) per core:  2
    Core(s) per socket:  1
    Socket(s):           1
    Stepping:            0
    BogoMIPS:            4399.99
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mc
                         a cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscal
                         l nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopo
                         logy nonstop_tsc cpuid tsc_known_freq pni pclmulqdq sss
                         e3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes 
                         xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowpref



# Install sherpa-onnx

In [2]:
! pip install sherpa-onnx

Collecting sherpa-onnx
  Downloading sherpa_onnx-1.7.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m31.5 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece==0.1.96 (from sherpa-onnx)
  Downloading sentencepiece-0.1.96-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m41.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece, sherpa-onnx
Successfully installed sentencepiece-0.1.96 sherpa-onnx-1.7.9


# Download the non-streaming zipformer pre-trained model

In [3]:
%%shell

# Please see https://k2-fsa.github.io/sherpa/onnx/pretrained_models/offline-transducer/zipformer-transducer-models.html#csukuangfj-sherpa-onnx-zipformer-en-2023-06-26-english

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/csukuangfj/sherpa-onnx-zipformer-en-2023-06-26
cd sherpa-onnx-zipformer-en-2023-06-26
git lfs pull --include "*.onnx"

Cloning into 'sherpa-onnx-zipformer-en-2023-06-26'...
remote: Enumerating objects: 20, done.[K
remote: Total 20 (delta 0), reused 0 (delta 0), pack-reused 20[K
Unpacking objects: 100% (20/20), 666.78 KiB | 6.67 MiB/s, done.




## float32 model test

In [4]:
%%shell

sherpa-onnx-offline \
  --tokens=./sherpa-onnx-zipformer-en-2023-06-26/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-en-2023-06-26/encoder-epoch-99-avg-1.onnx \
  --decoder=./sherpa-onnx-zipformer-en-2023-06-26/decoder-epoch-99-avg-1.onnx \
  --joiner=./sherpa-onnx-zipformer-en-2023-06-26/joiner-epoch-99-avg-1.onnx \
  --num-threads=1 \
  --decoding-method=greedy_search \
  ./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/1.wav

/project/sherpa-onnx/csrc/parse-options.cc:Read:361 sherpa-onnx-offline --tokens=./sherpa-onnx-zipformer-en-2023-06-26/tokens.txt --encoder=./sherpa-onnx-zipformer-en-2023-06-26/encoder-epoch-99-avg-1.onnx --decoder=./sherpa-onnx-zipformer-en-2023-06-26/decoder-epoch-99-avg-1.onnx --joiner=./sherpa-onnx-zipformer-en-2023-06-26/joiner-epoch-99-avg-1.onnx --num-threads=1 --decoding-method=greedy_search ./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/1.wav 

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-zipformer-en-2023-06-26/encoder-epoch-99-avg-1.onnx", decoder_filename="./sherpa-onnx-zipformer-en-2023-06-26/decoder-epoch-99-avg-1.onnx", joiner_filename="./sherpa-onnx-zipformer-en-2023-06-26/joiner-epoch-99-avg-1.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whis



## int8 model test

In [5]:
%%shell
sherpa-onnx-offline \
  --tokens=./sherpa-onnx-zipformer-en-2023-06-26/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-en-2023-06-26/encoder-epoch-99-avg-1.int8.onnx \
  --decoder=./sherpa-onnx-zipformer-en-2023-06-26/decoder-epoch-99-avg-1.int8.onnx \
  --joiner=./sherpa-onnx-zipformer-en-2023-06-26/joiner-epoch-99-avg-1.int8.onnx \
  --num-threads=1 \
  --decoding-method=greedy_search \
  ./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/1.wav

/project/sherpa-onnx/csrc/parse-options.cc:Read:361 sherpa-onnx-offline --tokens=./sherpa-onnx-zipformer-en-2023-06-26/tokens.txt --encoder=./sherpa-onnx-zipformer-en-2023-06-26/encoder-epoch-99-avg-1.int8.onnx --decoder=./sherpa-onnx-zipformer-en-2023-06-26/decoder-epoch-99-avg-1.int8.onnx --joiner=./sherpa-onnx-zipformer-en-2023-06-26/joiner-epoch-99-avg-1.int8.onnx --num-threads=1 --decoding-method=greedy_search ./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/1.wav 

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-zipformer-en-2023-06-26/encoder-epoch-99-avg-1.int8.onnx", decoder_filename="./sherpa-onnx-zipformer-en-2023-06-26/decoder-epoch-99-avg-1.int8.onnx", joiner_filename="./sherpa-onnx-zipformer-en-2023-06-26/joiner-epoch-99-avg-1.int8.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDec



# Download the streaming zipformer pre-trained model

In [6]:
%%shell

# Please see
# https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-transducer/zipformer-transducer-models.html#csukuangfj-sherpa-onnx-streaming-zipformer-en-2023-06-26-english

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26
cd sherpa-onnx-streaming-zipformer-en-2023-06-26
git lfs pull --include "*.onnx"

Cloning into 'sherpa-onnx-streaming-zipformer-en-2023-06-26'...
remote: Enumerating objects: 27, done.[K
remote: Total 27 (delta 0), reused 0 (delta 0), pack-reused 27[K
Unpacking objects: 100% (27/27), 667.63 KiB | 6.13 MiB/s, done.




## float32 model test

In [7]:
%%shell
sherpa-onnx \
  --tokens=./sherpa-onnx-streaming-zipformer-en-2023-06-26/tokens.txt \
  --encoder=./sherpa-onnx-streaming-zipformer-en-2023-06-26/encoder-epoch-99-avg-1-chunk-16-left-128.onnx \
  --decoder=./sherpa-onnx-streaming-zipformer-en-2023-06-26/decoder-epoch-99-avg-1-chunk-16-left-128.onnx \
  --joiner=./sherpa-onnx-streaming-zipformer-en-2023-06-26/joiner-epoch-99-avg-1-chunk-16-left-128.onnx \
  ./sherpa-onnx-streaming-zipformer-en-2023-06-26/test_wavs/1.wav

/project/sherpa-onnx/csrc/parse-options.cc:Read:361 sherpa-onnx --tokens=./sherpa-onnx-streaming-zipformer-en-2023-06-26/tokens.txt --encoder=./sherpa-onnx-streaming-zipformer-en-2023-06-26/encoder-epoch-99-avg-1-chunk-16-left-128.onnx --decoder=./sherpa-onnx-streaming-zipformer-en-2023-06-26/decoder-epoch-99-avg-1-chunk-16-left-128.onnx --joiner=./sherpa-onnx-streaming-zipformer-en-2023-06-26/joiner-epoch-99-avg-1-chunk-16-left-128.onnx ./sherpa-onnx-streaming-zipformer-en-2023-06-26/test_wavs/1.wav 

OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OnlineModelConfig(transducer=OnlineTransducerModelConfig(encoder="./sherpa-onnx-streaming-zipformer-en-2023-06-26/encoder-epoch-99-avg-1-chunk-16-left-128.onnx", decoder="./sherpa-onnx-streaming-zipformer-en-2023-06-26/decoder-epoch-99-avg-1-chunk-16-left-128.onnx", joiner="./sherpa-onnx-streaming-zipformer-en-2023-06-26/joiner-epoch-99-avg-1-chunk-16-left-128.onnx"), paraformer=O



## int8 model test

In [8]:
%%shell
sherpa-onnx \
  --tokens=./sherpa-onnx-streaming-zipformer-en-2023-06-26/tokens.txt \
  --encoder=./sherpa-onnx-streaming-zipformer-en-2023-06-26/encoder-epoch-99-avg-1-chunk-16-left-128.int8.onnx \
  --decoder=./sherpa-onnx-streaming-zipformer-en-2023-06-26/decoder-epoch-99-avg-1-chunk-16-left-128.int8.onnx \
  --joiner=./sherpa-onnx-streaming-zipformer-en-2023-06-26/joiner-epoch-99-avg-1-chunk-16-left-128.int8.onnx \
  ./sherpa-onnx-streaming-zipformer-en-2023-06-26/test_wavs/1.wav

/project/sherpa-onnx/csrc/parse-options.cc:Read:361 sherpa-onnx --tokens=./sherpa-onnx-streaming-zipformer-en-2023-06-26/tokens.txt --encoder=./sherpa-onnx-streaming-zipformer-en-2023-06-26/encoder-epoch-99-avg-1-chunk-16-left-128.int8.onnx --decoder=./sherpa-onnx-streaming-zipformer-en-2023-06-26/decoder-epoch-99-avg-1-chunk-16-left-128.int8.onnx --joiner=./sherpa-onnx-streaming-zipformer-en-2023-06-26/joiner-epoch-99-avg-1-chunk-16-left-128.int8.onnx ./sherpa-onnx-streaming-zipformer-en-2023-06-26/test_wavs/1.wav 

OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OnlineModelConfig(transducer=OnlineTransducerModelConfig(encoder="./sherpa-onnx-streaming-zipformer-en-2023-06-26/encoder-epoch-99-avg-1-chunk-16-left-128.int8.onnx", decoder="./sherpa-onnx-streaming-zipformer-en-2023-06-26/decoder-epoch-99-avg-1-chunk-16-left-128.int8.onnx", joiner="./sherpa-onnx-streaming-zipformer-en-2023-06-26/joiner-epoch-99-avg-1-chunk-16-left



# RTF on GPU

If you are curious about the RTF on GPU, we can compute a rough number from the decoding logs.

## non-streaming

For instance, for the non-streaming model used above, its decoding logs for the librispeech test-clean and test-other can be found at
https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15/blob/main/decoding_result/greedy_search/log-decode-epoch-30-avg-9-context-2-max-sym-per-frame-1-use-averaged-model-2023-05-15-19-37-19

From the logs, we can get the following data:

||start | end| duration (s)|
|---|---|---|---|
|test-clean|19:37:39|19:38:08|29|
|test-other|19:38:10|19:38:36|26|

The test-clean dataset has 2 hours 42 minutes of data, while the test-other dataset has 2 hours 15 minutes of data.

So the RTF for test clean is:
```
29 / (2 hours 42 minutes) = 29 / 9720 = 0.00298
```

The RTF for test-other is
```
26 / (2 hours 15 minutes) = 26 / 8100 = 0.00321
```

Note that the GPU used during decoding is V100 with 32GB RAM.

## streaming

For the streaming model, the decoding logs can be found at
https://huggingface.co/Zengwei/icefall-asr-librispeech-streaming-zipformer-2023-05-17/blob/main/decoding_results/streaming/greedy_seearch/log-decode-epoch-30-avg-8-chunk-16-left-context-128-use-averaged-model-2023-05-12-20-46-34

From the logs, we can get the following data:

||start | end| duration (s)|
|---|---|---|---|
|test-clean|20:46:41|20:48:08|87|
|test-other|20:48:08|20:49:38|90|

So the RTF for test clean is:
```
87 / (2 hours 42 minutes) = 87 / 9720 = 0.00895
```

The RTF for test-other is
```
90 / (2 hours 15 minutes) = 90 / 8100 = 0.01111
```

Note that the GPU used during decoding is V100 with 32GB RAM.

---


The RTFs are summarized in the following table:

|model type| dataset | RTF|
|---|---|---|
|non-streaming| test-clean |0.00298|
|non-streaming| test-other |0.00321|
|streaming| test-clean |0.00895|
|streaming| test-other |0.01111|

**Note**: We use PyTorch with GPU during decoding the librispeech test dataset.

