# Introduction to Faster Whisper

## Whisper: Robust Speech Recognition via Large-Scale Weak Supervision

Whisper, developed by OpenAI, excels in Automatic Speech Recognition (ASR) tasks by demonstrating high performance and strong generalization across datasets and domains without requiring fine-tuning. Its strength lies in training on an extensive dataset of 680,000 hours of multilingual and multitask supervised data sourced from the web. This dataset includes audio paired with existing transcriptions, such as videos with transcriptions provided by owners on platforms like YouTube. This approach minimizes the effort spent on labeling data and instead focuses on data cleaning, hence termed "Weak Supervision."

At the core of Whisper is a Transformer-based sequence-to-sequence model trained across various speech processing tasks: multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. This diversity makes Whisper a robust ASR model.

## CTranslate2 and Faster-Whisper: Optimizing Transformer Model Inference

CTranslate2 is a C++ and Python library designed for efficient inference with Transformer models. Engineered for high performance, it incorporates optimizations such as weight quantization, layer fusion, and batch reordering. These optimizations enhance speed and minimize memory usage on both CPUs and GPUs.

CTranslate2 supports a diverse array of model types, catering to various needs:

- **Encoder-decoder** models like Transformer base/big, M2M-100, and Whisper.
- **Decoder-only** models such as GPT-2, GPT-J, and BERT.
- **Encoder-only** models like BERT and XLM-RoBERTa.

Faster-Whisper utilizes the Whisper model implemented with CTranslate, offering up to 4 times faster inference speeds with reduced memory usage compared to openai/whisper, while maintaining the same accuracy. Further efficiency gains are achievable through 8-bit quantization on both CPU and GPU.

## What's make it fast?

- Optimized Execution: The framework achieves fast and efficient execution on both CPU and GPU through a variety of advanced optimizations. These include layer fusion, padding removal, batch reordering, in-place operations, and caching mechanisms, resulting in reduced resource consumption compared to general-purpose deep learning frameworks.

- Quantization and Precision Reduction: The framework supports model serialization and computation with weights of reduced precision, including 16-bit floating-point (FP16), 16-bit brain floating-point (BF16), 16-bit integer (INT16), 8-bit integer (INT8), and AWQ quantization (INT4). These techniques contribute to improved performance and reduced model size.

## Benchmark

For reference, here's the time and memory usage that are required to transcribe 13 minutes of audio using different implementations.

- Large-v2 model on GPU


| Implementation   | Precision | Beam size | Time   | Max. GPU memory | Max. CPU memory |
|------------------|-----------|-----------|--------|-----------------|-----------------|
| openai/whisper   |    fp16    |        5   |  4m30s |      11325MB    |      9439MB     |
| faster-whisper   |    fp16    |        5   |    54s |       4755MB    |      3244MB     |
| faster-whisper   |    int8    |        5   |    59s |       3091MB    |      3117MB     |


- Small model on CPU
  
| Implementation   | Precision | Beam size | Time    | Max. memory |
|------------------|-----------|-----------|---------|-------------|
| openai/whisper   | fp32      | 5         | 10m31s  | 3101MB      |
| whisper.cpp      | fp32      | 5         | 17m42s  | 1581MB      |
| whisper.cpp      | fp16      | 5         | 12m39s  | 873MB       |
| faster-whisper   | fp32      | 5         | 2m44s   | 1675MB      |
| faster-whisper   | int8      | 5         | 2m04s   | 995MB       |

3.
- Support for Multiple CPU Architectures: The system is compatible with x86-64 and AArch64/ARM64 processors, incorporating multiple optimized backends such as Intel MKL, oneDNN, OpenBLAS, Ruy, and Apple Accelerate. This ensures broad compatibility and efficient execution across different platforms.

- Automatic CPU Detection and Code Dispatch: The binary is designed to include multiple backends (e.g., Intel MKL and oneDNN) and instruction set architectures (e.g., AVX, AVX2). The appropriate backend and instruction set are automatically selected at runtime based on the detected CPU configuration.

- Parallel and Asynchronous Execution: The framework supports parallel and asynchronous processing of multiple batches using multiple GPUs or CPU cores, enabling efficient handling of large-scale computations.

- Dynamic Memory Management: Memory usage is dynamically adjusted according to request size, facilitated by caching allocators for both CPU and GPU. This approach ensures that performance requirements are met while optimizing memory utilization.

- Reduced Disk Footprint: Quantization techniques can reduce model sizes on disk by up to four times with minimal loss in accuracy, contributing to more efficient storage and deployment.

- Simple Integration: The project features minimal dependencies and provides straightforward APIs in Python and C++ for ease of integration, covering a broad range of integration scenarios.

- Configurable and Interactive Decoding: The framework offers advanced decoding capabilities, including the ability to autocomplete partial sequences and return alternative outputs at specific points in the sequence, enhancing flexibility and accuracy in inference.

- Tensor Parallelism for Distributed Inference: For very large models, the framework supports tensor parallelism, allowing models to be distributed across multiple GPUs. The documentation provides detailed instructions for setting up the necessary environment for distributed inference.

# CTranslate2 Whisper

## Preparation

To install Ctranslate Whisper, run:

In [28]:
!pip install git+https://github.com/openai/whisper.git --quiet
!pip install transformers[torch]>=4.23 --quiet
!pip install --upgrade ctranslate2 --quiet

## Import libaries

In [45]:
import whisper
import numpy as np
import librosa
import torchaudio
from transformers import WhisperTokenizer, WhisperProcessor
import torch
import ctranslate2
from torchaudio.utils import download_asset
import IPython.display as ipd

Prepare audio file

In [66]:

SAMPLE_WAV = download_asset("tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav")
wave_form, sampling_rate = torchaudio.load(SAMPLE_WAV)
wave_form = wave_form.numpy()
ipd.Audio(wave_form, rate=sampling_rate)

## Convert Whisper model to Ctranslate2 Whisper

The Whisper model come in various versions, including whisper-tiny, whisper-small, whisper-base, and whisper-large. You can load the Whisper model from Hugging Face. In this guide, we will use the whisper-base model. First, we need to initialize the Whisper processor, which is responsible for encoding the input and decoding the output into a suitable format.

In [55]:
processor = WhisperProcessor.from_pretrained("openai/whisper-base")
tokenizer = processor.tokenizer

In [68]:
processor.feature_extractor.nb_max_frames

3000

CTranslate2 provide command to converts the Whisper model to the CTranslate2 format, saves it in the whisper-base directory, copies the specified configuration files, applies float16 quantization, and overwrites any existing files.

In [None]:
#The original model was converted with the following command:
!ct2-transformers-converter --model openai/whisper-base --output_dir whisper-base \
    --copy_files tokenizer.json preprocessor_config.json --quantization float16 --force

## CTranslate2 Whisper Whisper

Load the Ctranslate2 Whisper model using the code below and set it up to run on the CPU. Alternatively, it is possible to use a GPU by setting the device variable to "cuda".

In [56]:
translator = ctranslate2.models.Whisper("./whisper-base", device="cpu")



We use the processor to extract mel spectrograms from audio. After that, CTranslate2 encodes the mel spectrograms into a StorageView using the ctranslate2.models.Whisper.encode() function. CTranslate2 uses StorageView as its input feature.

In [70]:
inputs = processor(wave_form, return_tensors="np", sampling_rate=16000)
print(inputs.input_features.shape)
features = ctranslate2.StorageView.from_array(inputs.input_features)
features.shape

(1, 80, 3000)


[1, 80, 3000]

In [60]:

wave_form2, sampling_rate = torchaudio.load("./test_audio.wav")
wave_form2 = wave_form2.numpy()
inputs = processor(wave_form2, return_tensors="np", sampling_rate=16000)
features = ctranslate2.StorageView.from_array(inputs.input_features)
features.shape

[1, 80, 3000]

In [65]:
inputs.input_features.shape

(1, 80, 3000)

**Methods of Ctranslate2 for the Whisper Model**

**1. Align**

Computes the alignment between the text tokens and the audio. This method is used to match parts of the text with corresponding segments of the audio.


In addition to the input audio features, we also need to provide other input parameters for this method, including:

- *start_sequence* is the initial set of tokens or starting point for the
alignment process.
- *text_tokens* are the tokens of the text that the audio features should be aligned with.
- *num_frames* is the number of non-padding frames in the audio features.

Example Result

```python
[WhisperAlignmentResult(
    alignments=[(0, 0), (1, 0), (2, 0), (3, 0), (4, 0), (5, 0), (5, 1), (5, 2), (5, 3),...],
    text_token_probs=[0.0, 0.0, 9.31674730964005e-05, 0.00015034245734568685, 9.820470586419106e-05, 1.2290715858398471e-05,...]
)]
```

Here’s a simplified breakdown:

- Alignments: A list of tuples where each tuple represents a mapping between the audio frames and text tokens. For instance, the tuple (5, 0) indicates that the audio frame 5 corresponds to text token 0.
- Text Token Probabilities: A list of probabilities for each text token, indicating the confidence level for each token being represented in the audio. For example, a probability of 0.00015034245734568685 suggests a certain level of confidence for the corresponding text token.

**2. Detect language**

This method uses the StorageView encoding feature to detect the language from the audio.


In [71]:
results = translator.detect_language(features)
language, probability = results[0][0]
print("Detected language %s with probability %f" % (language, probability))

Detected language <|en|> with probability 0.995563


Example result:
```python
[[('<|en|>', 0.4817270040512085), ('<|zh|>', 0.13297148048877716), ('<|ur|>', 0.05135079845786095), ('<|ko|>', 0.0501541793346405), ('<|hi|>', 0.04444431513547897), ('<|jw|>', 0.03632590174674988), ('<|nn|>', 0.025384925305843353), ('<|th|>', 0.021732434630393982), ('<|ar|>', 0.017658809199929237), ('<|fi|>', 0.017540380358695984), ('<|tl|>', 0.013759175315499306), ('<|mi|>', 0.01273771096020937), ('<|es|>', 0.010259411297738552)]]
```

The output from the detected_language method provides a list of language codes with their associated probabilities. Each entry in the list consists of a language code and a probability score, representing the likelihood of that language being present in the provided audio features.

**3. Generate**

The generate method creates text from audio features and prompts.

In [72]:
results = translator.detect_language(features)
language, probability = results[0][0]
prompt = processor.tokenizer.convert_tokens_to_ids(
    [
        "<|startoftranscript|>",
        language,
        "<|transcribe|>",
        "<|notimestamps|>",  # Remove this token to generate timestamps.
    ]
)

# Run generation for the 30-second window.
results = translator.generate(features, [prompt])
transcription = processor.decode(results[0].sequences_ids[0])
print(transcription)

 I had that curiosity beside me at this moment.


In [33]:
text_tokens = tokenizer(transcription, return_tensors="pt").input_ids.tolist()
text_tokens
# 50258: AddedToken("<|startoftranscript|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
# 50363: AddedToken("<|notimestamps|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
# 50257: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),

[[50258, 50363, 286, 632, 300, 18769, 15726, 385, 412, 341, 1623, 13, 50257]]

In [32]:
text = processor.batch_decode([[50257]], skip_special_tokens=True)[0]
text

''

In [31]:
tokenizer

WhisperTokenizer(name_or_path='openai/whisper-base', vocab_size=50258, model_max_length=1024, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|endoftext|>', '<|startoftranscript|>', '<|en|>', '<|zh|>', '<|de|>', '<|es|>', '<|ru|>', '<|ko|>', '<|fr|>', '<|ja|>', '<|pt|>', '<|tr|>', '<|pl|>', '<|ca|>', '<|nl|>', '<|ar|>', '<|sv|>', '<|it|>', '<|id|>', '<|hi|>', '<|fi|>', '<|vi|>', '<|he|>', '<|uk|>', '<|el|>', '<|ms|>', '<|cs|>', '<|ro|>', '<|da|>', '<|hu|>', '<|ta|>', '<|no|>', '<|th|>', '<|ur|>', '<|hr|>', '<|bg|>', '<|lt|>', '<|la|>', '<|mi|>', '<|ml|>', '<|cy|>', '<|sk|>', '<|te|>', '<|fa|>', '<|lv|>', '<|bn|>', '<|sr|>', '<|az|>', '<|sl|>', '<|kn|>', '<|et|>', '<|mk|>', '<|br|>', '<|eu|>', '<|is|>', '<|hy|>', '<|ne|>', '<|mn|>', '<|bs|>', '<|kk|>', '<|sq|>', '<|sw|>', '<|gl|>', '<|mr|>', '<|pa|>', '<|si|

In [19]:

start_sequence = [50258]

In [74]:
# Tokenize the text
text_tokens = tokenizer(transcription, return_tensors="pt").input_ids.tolist()
start_sequence = [tokenizer.pad_token_id]  # Define start-of-sequence token
start_sequence = [50258]
# Perform alignment
alignment_result = translator.align(
    features=features,
    start_sequence=start_sequence,
    text_tokens=text_tokens,
    num_frames=3000
)

print(alignment_result)

[WhisperAlignmentResult(alignments=[(0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7), (0, 8), (0, 9), (0, 10), (0, 11), (0, 12), (0, 13), (0, 14), (0, 15), (0, 16), (0, 17), (0, 18), (0, 19), (0, 20), (0, 21), (0, 22), (0, 23), (0, 24), (0, 25), (0, 26), (0, 27), (0, 28), (0, 29), (0, 30), (0, 31), (0, 32), (0, 33), (0, 34), (0, 35), (0, 36), (1, 36), (2, 36), (3, 36), (3, 37), (3, 38), (3, 39), (3, 40), (3, 41), (3, 42), (3, 43), (3, 44), (3, 45), (4, 45), (4, 46), (4, 47), (4, 48), (4, 49), (4, 50), (4, 51), (4, 52), (4, 53), (4, 54), (5, 54), (5, 55), (5, 56), (5, 57), (5, 58), (5, 59), (5, 60), (5, 61), (5, 62), (5, 63), (5, 64), (5, 65), (5, 66), (5, 67), (5, 68), (5, 69), (5, 70), (5, 71), (5, 72), (5, 73), (5, 74), (5, 75), (5, 76), (5, 77), (5, 78), (5, 79), (5, 80), (5, 81), (5, 82), (5, 83), (5, 84), (5, 85), (5, 86), (5, 87), (5, 88), (5, 89), (5, 90), (5, 91), (5, 92), (5, 93), (6, 93), (6, 94), (6, 95), (6, 96), (6, 97), (6, 98), (6, 99), (6, 100), (6, 101), 

In [91]:
transcription

' I had that curiosity beside me at this moment.'

In [75]:
tokenizer

WhisperTokenizer(name_or_path='openai/whisper-base', vocab_size=50258, model_max_length=1024, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|endoftext|>', '<|startoftranscript|>', '<|en|>', '<|zh|>', '<|de|>', '<|es|>', '<|ru|>', '<|ko|>', '<|fr|>', '<|ja|>', '<|pt|>', '<|tr|>', '<|pl|>', '<|ca|>', '<|nl|>', '<|ar|>', '<|sv|>', '<|it|>', '<|id|>', '<|hi|>', '<|fi|>', '<|vi|>', '<|he|>', '<|uk|>', '<|el|>', '<|ms|>', '<|cs|>', '<|ro|>', '<|da|>', '<|hu|>', '<|ta|>', '<|no|>', '<|th|>', '<|ur|>', '<|hr|>', '<|bg|>', '<|lt|>', '<|la|>', '<|mi|>', '<|ml|>', '<|cy|>', '<|sk|>', '<|te|>', '<|fa|>', '<|lv|>', '<|bn|>', '<|sr|>', '<|az|>', '<|sl|>', '<|kn|>', '<|et|>', '<|mk|>', '<|br|>', '<|eu|>', '<|is|>', '<|hy|>', '<|ne|>', '<|mn|>', '<|bs|>', '<|kk|>', '<|sq|>', '<|sw|>', '<|gl|>', '<|mr|>', '<|pa|>', '<|si|

In [15]:
len(text_tokens[0]), text_tokens

(13,
 [[50258, 50363, 286, 632, 300, 18769, 15726, 385, 412, 341, 1623, 13, 50257]])

In [90]:
processor.decode([18769])

' curiosity'

In [88]:
index = 286
token = tokenizer.convert_ids_to_tokens(index)
token

'ĠI'

In [None]:
metadata = torchaudio.info(SAMPLE_WAV)
print(metadata)

In [85]:
transcription

' I had that curiosity beside me at this moment.'

In [87]:
text_tokens = tokenizer(transcription, return_tensors="pt")
text_tokens

{'input_ids': tensor([[50258, 50363,   286,   632,   300, 18769, 15726,   385,   412,   341,
          1623,    13, 50257]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

# Faster Whisper transcription with CTranslate2

In the previous section, we explored CTranslate2 and the functionality of its various methods. In this section, we will introduce a specific implementation of the Whisper model, leveraging the operational principles of CTranslate2, known as Faster-Whisper.

Faster-Whisper represents a reimplementation of OpenAI's Whisper model utilizing CTranslate2, a high-performance inference engine designed for Transformer models. This implementation achieves up to a fourfold increase in inference speed compared to openai/whisper, while maintaining equivalent accuracy and requiring less memory. Furthermore, its efficiency can be enhanced through 8-bit quantization on both CPU and GPU, optimizing performance further.

## Installation

In [17]:
!pip install faster-whisper -U --quiet
!pip install transformers -U --quiet

[0m

In [92]:
#Other installation methods
!pip install --force-reinstall "faster-whisper @ https://github.com/SYSTRAN/faster-whisper/archive/refs/heads/master.tar.gz" --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
audiomentations 0.37.0 requires scipy<1.13,>=1.4, but you have scipy 1.14.1 which is incompatible.
transformers 4.44.2 requires tokenizers<0.20,>=0.19, but you have tokenizers 0.20.0 which is incompatible.
datasets 2.21.0 requires fsspec[http]<=2024.6.1,>=2023.1.0, but you have fsspec 2024.9.0 which is incompatible.
torchvision 0.19.0 requires torch==2.4.0, but you have torch 2.4.1 which is incompatible.[0m[31m
[0m

**Sequential Inference faster-whisper**

First, we use the Sequential inference method for the Whisper model, where the input audio is segmented and processed sequentially. Each audio segment is transcribed one by one, ensuring the model handles the input step by step.

Here’s a summary of how Sequential inference method handles the transcription:
- Sequential Processing: this method segments the audio and processes each segment in sequence. This allows the model to generate text from each audio chunk while maintaining context from previous segments.
- Language Detection & VAD: It can automatically detect the language if not specified and filter out non-speech segments using voice activity detection.
- Final Output: this method returns a generator that yields transcribed segments and additional transcription details.

In [93]:
import time
from faster_whisper import WhisperModel

# Define the file path for the audio to be transcribed
filepath = SAMPLE_WAV

# Configuration parameters for transcription
word_timestamps = False  # Do not include word-level timestamps
vad_filter = True  # Apply Voice Activity Detection to remove non-speech segments
temperature = 0.0  # Use deterministic transcription
language = "en"  # Set language to English
model_size = "base"  # Use the "large-v3" model
device, compute_type = "cpu", "float32"  # Set computation device and precision

# Initialize the Whisper model
model = WhisperModel(model_size, device=device, compute_type=compute_type)

# Transcribe the audio file with the specified settings
segments, transcription_info = model.transcribe(
    filepath,
    word_timestamps=word_timestamps,
    vad_filter=vad_filter,
    temperature=temperature,
    language=language,
)

# Output the transcription results and metadata
print(segments, transcription_info)


<generator object restore_speech_timestamps at 0x710d75a8b740> TranscriptionInfo(language='en', language_probability=1, duration=3.4, duration_after_vad=3.192, all_language_probs=None, transcription_options=TranscriptionOptions(beam_size=5, best_of=5, patience=1, length_penalty=1, repetition_penalty=1, no_repeat_ngram_size=0, log_prob_threshold=-1.0, log_prob_low_threshold=None, no_speech_threshold=0.6, compression_ratio_threshold=2.4, condition_on_previous_text=True, prompt_reset_on_temperature=0.5, temperatures=[0.0], initial_prompt=None, prefix=None, suppress_blank=True, suppress_tokens=(1, 2, 7, 8, 9, 10, 14, 25, 26, 27, 28, 29, 31, 58, 59, 60, 61, 62, 63, 90, 91, 92, 93, 359, 503, 522, 542, 873, 893, 902, 918, 922, 931, 1350, 1853, 1982, 2460, 2627, 3246, 3253, 3268, 3536, 3846, 3961, 4183, 4667, 6585, 6647, 7273, 9061, 9383, 10428, 10929, 11938, 12033, 12331, 12562, 13793, 14157, 14635, 15265, 15618, 16553, 16604, 18362, 18956, 20075, 21675, 22520, 26130, 26161, 26435, 28279, 294

Output

segments

```python
<generator object restore_speech_timestamps at 0x7ec0dd52edc0>
```

transcription_info

```python
TranscriptionInfo(language='en', language_probability=1, duration=5.064, duration_after_vad=4.408, all_language_probs=None, transcription_options=TranscriptionOptions(beam_size=5, best_of=5, patience=1, length_penalty=1, repetition_penalty=1, no_repeat_ngram_size=0, log_prob_threshold=-1.0, no_speech_threshold=0.6, compression_ratio_threshold=2.4, condition_on_previous_text=True, prompt_reset_on_temperature=0.5, temperatures=[0.0], initial_prompt=None, prefix=None, suppress_blank=True, suppress_tokens=[-1], without_timestamps=False, max_initial_timestamp=1.0, word_timestamps=False, prepend_punctuations='"\'“¿([{-', append_punctuations='"\'.。,，!！?？:：”)]}、', max_new_tokens=None, clip_timestamps='0', hallucination_silence_threshold=None, hotwords=None), vad_options=VadOptions(threshold=0.5, min_speech_duration_ms=250, max_speech_duration_s=inf, min_silence_duration_ms=2000, speech_pad_ms=400))
```



Segments are transcription segments extracted from the audio file, each represented as an object. Each segment includes information about the audio portion it corresponds to, such as start time, end time, and transcribed text. By iterating through these segments, you can extract and process the entire transcription content.

In [94]:
for segment in segments:
    row = {
        "start": segment.start,
        "end": segment.end,
        "text": segment.text,
    }
    if word_timestamps:
        row["words"] = [
            {"start": word.start, "end": word.end, "word": word.word}
            for word in segment.words
        ]
    print(row)

{'start': 0.21, 'end': 3.21, 'text': ' I had that curiosity beside me at this moment.'}


Additionally, to extracts word-level timestamps from an audio file, providing the start and end times for each word. This detailed information facilitates in-depth analysis and processing of the transcription data.

In [95]:
segments, _ = model.transcribe(SAMPLE_WAV, word_timestamps=True)

for segment in segments:
    for word in segment.words:
        print("[%.2fs -> %.2fs] %s" % (word.start, word.end, word.word))

[0.00s -> 0.68s]  I
[0.68s -> 0.82s]  had
[0.82s -> 1.00s]  that
[1.00s -> 1.62s]  curiosity
[1.62s -> 2.12s]  beside
[2.12s -> 2.42s]  me
[2.42s -> 2.58s]  at
[2.58s -> 2.74s]  this
[2.74s -> 3.08s]  moment.


**Batched inference faster-whisper**

Parallel processing of audio chunks can significantly enhance inference performance compared to sequential processing methods, such as those used in sequential inference with Faster Whisper. This method involves:

- Breaking the audio into semantically meaningful chunks.
- Transcribing these chunks in parallel (as batches), utilizing a faster feature extraction process and VAD to skip non-speech portions.

This approach results in considerably faster transcription, especially for long audio files, without sacrificing accuracy.



In [103]:
from faster_whisper import WhisperModel, BatchedInferencePipeline

model = WhisperModel("medium", device="cpu", compute_type="float32")
batched_model = BatchedInferencePipeline(model=model)
segments, info = batched_model.transcribe("audio.mp3", batch_size=16)

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

KeyboardInterrupt: 

In [105]:
from faster_whisper import WhisperModel, BatchedInferencePipeline

# Define file path for the audio to be transcribed
filename = SAMPLE_WAV

# Configuration settings
word_timestamps = False  # Disable word-level timestamps
vad_filter = True  # Enable Voice Activity Detection to filter out non-speech segments
temperature = 0.0  # Set temperature to 0.0 for deterministic transcription
language = "en"  # Set language to English
model_size = "base"  # Use the "large-v3" Whisper model
device, compute_type = "cpu", "float32"  # Set computation to use CPU and float32 precision

# Initialize the Whisper model
model = WhisperModel(model_size, device=device, compute_type=compute_type)

# Wrap the model in a BatchedInferencePipeline for batch processing
batched_model = BatchedInferencePipeline(model=model, use_vad_model = False)

# Perform transcription on the audio file using batch processing with batch_size of 16
segments, info = batched_model.transcribe(filename, batch_size=16)

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

[0.00s -> 3.40s]  I had that curiosity beside me at this moment.


**Multi-segment language detection**

This language detection method uses the detect_language_multi_segment method to analyze an audio file. This method segments the audio into multiple parts and determines the language based on highly-confident segments, aggregating these results to accurately identify the overall language. By leveraging a segmented approach, the method enhances reliability and accuracy, particularly in cases with varying audio quality.

In [24]:
from faster_whisper import WhisperModel

language_info = model.detect_language_multi_segment(SAMPLE_WAV)
print(language_info)



{'language_code': 'en', 'language_confidence': 0.9965774416923523}


Finally, we will test all three inference methods, including the Whisper model from OpenAI, Faster Whisper Sequential, and Faster Whisper Batched, using the same audio sample. This will allow us to compare the speed and results of each method.

In [109]:
import time
from faster_whisper import WhisperModel, BatchedInferencePipeline
import whisper

# Define the file path for the audio to be transcribed
file_path = SAMPLE_WAV

# Configuration parameters for transcription
word_timestamps = False  # Do not include word-level timestamps
vad_filter = True  # Apply Voice Activity Detection to remove non-speech segments
temperature = 0.0  # Use deterministic transcription
language = "en"  # Set language to English
model_size = "base"  # Use the "large-v3" model
device, compute_type = "cpu", "float32"  # Set computation device and precision

# Initialize the Whisper model
model = WhisperModel(model_size, device=device, compute_type=compute_type)

time1 = time.time()
# Transcribe the audio file with the specified settings
segments, transcription_info = model.transcribe(
    filepath,
    word_timestamps=word_timestamps,
    vad_filter=vad_filter,
    temperature=temperature,
    language=language,
)
time2 = time.time()

# Wrap the model in a BatchedInferencePipeline for batch processing
batched_model = BatchedInferencePipeline(model=model, use_vad_model = False)

time3 = time.time()
# Perform transcription on the audio file using batch processing with batch_size of 16
segments, info = batched_model.transcribe(filepath, batch_size=16)
time4 = time.time()

# Load the Whisper model
# model = whisper.load_model(model_size, device=device)

# time5 = time.time()
# # Perform transcription
# result = model.transcribe(file_path, language=language)
# time6 = time.time()

# print("Time inference from Whisper model from OpenAI: ", time6 - time5)
print("Time inference from Faster Whisper Sequential: ", time2 - time1)
print("Time inference from Faster Whisper Batched: ", time4 - time3)


Time inference from Faster Whisper Sequential:  0.037093400955200195
Time inference from Faster Whisper Batched:  1.003199815750122
