Intel optimized Speech-to-Text Analytics System performs voice-to-text transcription, diarization and speaker-aware transcription. Downstream tasks for associated analytics, such as listen-to-speak ratio, average talking speed, longest customer monologue, switches per conversation, average pause time and keyword spotting are computed using speaker-aware transcription.
Once the input audio is fed to the system, diarization pipeline generates speaker-level time stamps (time duration per speaker utterance) and transcription pipeline generates word-level timestamps (time stamps of each word). These two outputs are combined to get speaker aware transcription. All the associated analytics are computed from generated speaker aware transcription.
Note: Topic Modeling and Sentiment Analysis are work in progress.
- Quick Start
- Transcription Models
- Model Serving and Rest APIs
- User Interface
- Performance Tuning
- Live Transcription
- Supported Dtypes
- References
Create a virtual environment.
conda create -n asr_pipeline python=3.10 -y
conda activate asr_pipeline
Install dependenices using the installation script.
./install.sh
python main.py --config_file config/asr_transcription.yaml --input_audio <path_to_input_audio_file> --output_dir ./output/
During speaker aware transcription, this repo uses pyannote/speaker-diarization-3.1 for diarization. Please get access to diarization model from https://huggingface.co/pyannote/speaker-diarization-3.1.
Create a .env
file and set HF_ACCESS_TOKEN environment variable. This token is needed for diarization models. It can be obtained from HuggingFace.
HF_ACCESS_TOKEN=<your access token>
Run the speaker aware transcription pipeline.
python main.py --config_file config/asr_speaker_aware.yaml --input_audio <path_to_input_audio_file> --output_dir ./output/logs --target_keywords rabbit,duchess,king
On successful execution, the following outputs are generated inside output/logs
folder
- raw_transcription.txt - Transcription of input audio
- aligned_transcription.txt - Transcription time aligned with input audio
- diarization_transcription.txt - Separate audio by speaker with attribution
- segment_transcription.txt - Speaker utterance with timestamps
- analytics.txt - Computed associated analytics
- logs.txt - Information on configurations, execution and profiling.
Transcription supports two backends - HuggingFace and FasterWhisper
Please refer the links below for options on models(size) and their respective keys.
- OpenAI - https://huggingface.co/openai
- Distil-Whisper - https://huggingface.co/distil-whisper
The default HuggingFace transcription topology is distil-whisper/distil-large-v2
. To select a different topology, update asr_transcription.yaml
or asr_speaker_aware.yaml
accordingly.
Usage example,
transcription:
topology: distil-whisper/distil-large-v2 # model name key
compute_type: fp32
backend: HFWhisper
batch_size: 16 # Chunk batch size for long-form transcription
device: cpu
For model name keys, please refer https://github.com/m-bain/whisperX
The default FasterWhisper transcription topology is large-v2
. To select a different topology, update asr_transcription.yaml
or asr_speaker_aware.yaml
accordingly.
Usage example,
transcription:
topology: large-v2 # model name key
compute_type: fp32
backend: whisperx
batch_size: 16 # Chunk batch size for long-form transcription
device: cpu
For long-form transcription batch_size
option should be set.
This repo uses WhisperX for transcription alignment. Please refer WhisperX for more details.
The repo contains a Dockerfile
for easy and quick deployment. The rest_api
module contains necessary code for exposing REST API endpoints. APIs are built using FastAPI
and models are served using Ray Serve
.
sudo docker build -t intel-e2e-asr .
If there is an error in building an image due to proxy settings use following command,
sudo docker build -t intel-e2e-asr --build-arg HTTP_PROXY=$http_proxy --build-arg HTTPS_PROXY=$http_proxy --build-arg NO_PROXY="$no_proxy" --build-arg http_proxy=$http_proxy --build-arg https_proxy=$http_proxy --build-arg no_proxy="$no_proxy" .
File should look as follows,
HF_ACCESS_TOKEN=<your access token>
http_proxy=<proxy settings>
https_proxy=<proxy settings>
serve_config.yaml
is configuration file for serving models. num_cpus
and other options in the serve_config.yaml
should be set according to the platform.
sudo docker run -d --name asr_container --env-file=.env -p 8888:8000 intel-e2e-asr
To interact with APIs using Swagger UI
open http://localhost:8000/docs in your browser.
It is also possible to serve models with bare metal installations. Use following command.
serve run serve_config.yaml
This repo contains a streamlit
application to interact with APIs. To run streamlit
app,
python -m streamlit run ui/app.py
By default, the streamlit application runs on http://localhost:8501/
Prerequisites:
-
Examples using the
BFloat16
data type require machines with the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) BF16 and Intel® Advanced Matrix Extensions (Intel® AMX) BF16 instruction sets. -
Following performance tuning examples uses Intel® Extension for PyTorch* on Intel CPUs.
To speed-up the inference with the Low-Precision Data Type bfloat16
, specify bf16
for compute_type
key in the configuration file.
transcription:
topology: distil-whisper/distil-large-v2
compute_type: bf16 # uses bfloat16
backend: HFWhisper # HFWhisper or whisperx
batch_size: 16 # Chunk batch size for long-form transcription
device: cpu
During inference Auto Mixed Precision (AMP) is used with bfloat16 data type. bfloat16
is also supported for diarization and alignment models.
alignment:
compute_type: bf16
language: en
backend: whisperx_alignment
device: cpu
diarization:
pipeline_config: pyannote/speaker-diarization-3.1
use_auth_token: true
compute_type: bf16
device: cpu
backend: pyannote_diarization
Similary int8
can be specified for the compute_type
. It performs dynamic quantization using Intel® Extension for PyTorch*. Check the support matrix for details on supported data types Supported Dtype
For runtime environments settings jemalloc
, gperftools
and intel-openmp
are required.
main.sh
sets run time environment variables for better performance. Usage is as follows:
./main.sh --config config/asr_speaker_aware.yaml --input <path_to_input_audio_file> --output_dir ./output/logs
Please note by default it runs the application on all available cores. To bind execution to specific cores, use numactl
as explained below.
Assume cores 0-3 are on socket 0, the following command binds script execution on core 0-3, and binds memory access to socket 0 local memories.
numactl --membind 0 -C 0-3 python main.py --config_file config/asr_speaker_aware.yaml --input_audio <path_to_input_audio_file> --output_dir ./output/logs
Environment variable OMP_NUM_THREADS sets the number of threads used for parallel regions. By default, it is set to be the number of available physical cores. It can be used along with numactl settings, as the following example. If cores 0-3 are on socket 0, the following example command runs on cores 0-3, with 4 OpenMP threads.
OMP_NUM_THREADS=4 numactl -C 0-3 --membind 0 python main.py --config_file config/asr_speaker_aware.yaml --input_audio <path_to_input_audio_file> --output_dir ./output/logs
For more information read here https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/performance_tuning/tuning_guide.html
Recommended Hardware | Precision |
---|---|
Intel® 4th Gen Xeon® Scalable Performance processors | float32, bfloat16 |
Note: This is work in progress.
To run live transcription, use following script. Please note, as of now only real-time simulation from recorded file is supported.
python main_stream.py --config_file config/asr_live_stream.yaml --input_audio <path_to_input_audio_file> --output_dir ./logs/
This repo uses Whisper Streaming implementation for the live transcription.
Model | FP32 | BF16 | INT8 (Dynamic Quantization) |
---|---|---|---|
HuggingFace Whisper (openai) | ✅ | ✅ | ✅ |
HuggingFace Whisper (distil) | ✅ | ✅ | ✅ |
whisperx (Faster-Whisper) | ✅ | ❌ | ✅ |
whisperx alignment | ✅ | ✅ | ❌ |
pyannote diarization | ✅ | ✅ | ❌ |
- Whisper_streaming https://github.com/ufal/whisper_streaming
- Whisperx https://github.com/m-bain/whisperX
- FasterWhisper https://github.com/SYSTRAN/faster-whisper
- pyannote-audio https://github.com/pyannote/pyannote-audio
- Whisper https://github.com/openai/whisper
- Distil-Whisper https://github.com/huggingface/distil-whisper
- Intel® Extension for PyTorch* https://github.com/intel/intel-extension-for-pytorch