# Introduction to Faster Whisper 

## Whisper: Robust Speech Recognition via Large-Scale Weak Supervision

Whisper, developed by OpenAI, excels in Automatic Speech Recognition (ASR) tasks by demonstrating high performance and strong generalization across datasets and domains without requiring fine-tuning. Its strength lies in training on an extensive dataset of 680,000 hours of multilingual and multitask supervised data sourced from the web. This dataset includes audio paired with existing transcriptions, such as videos with transcriptions provided by owners on platforms like YouTube. This approach minimizes the effort spent on labeling data and instead focuses on data cleaning, hence termed "Weak Supervision."

At the core of Whisper is a Transformer-based sequence-to-sequence model trained across various speech processing tasks: multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. This diversity makes Whisper a robust ASR model.

## CTranslate2 and Faster-Whisper: Optimizing Transformer Model Inference

CTranslate2 is a C++ and Python library designed for efficient inference with Transformer models. Engineered for high performance, it incorporates optimizations such as weight quantization, layer fusion, and batch reordering. These optimizations enhance speed and minimize memory usage on both CPUs and GPUs.

CTranslate2 supports a diverse array of model types, catering to various needs:

- **Encoder-decoder** models like Transformer base/big, M2M-100, and Whisper.
- **Decoder-only** models such as GPT-2, GPT-J, and BERT.
- **Encoder-only** models like BERT and XLM-RoBERTa.

Faster-Whisper utilizes the Whisper model implemented with CTranslate, offering up to 4 times faster inference speeds with reduced memory usage compared to openai/whisper, while maintaining the same accuracy. Further efficiency gains are achievable through 8-bit quantization on both CPU and GPU.

## What's make it fast?

- Optimized Execution: The framework achieves fast and efficient execution on both CPU and GPU through a variety of advanced optimizations. These include layer fusion, padding removal, batch reordering, in-place operations, and caching mechanisms, resulting in reduced resource consumption compared to general-purpose deep learning frameworks.

- Quantization and Precision Reduction: The framework supports model serialization and computation with weights of reduced precision, including 16-bit floating-point (FP16), 16-bit brain floating-point (BF16), 16-bit integer (INT16), 8-bit integer (INT8), and AWQ quantization (INT4). These techniques contribute to improved performance and reduced model size.

## Benchmark

For reference, here's the time and memory usage that are required to transcribe 13 minutes of audio using different implementations.

- Large-v2 model on GPU


| Implementation   | Precision | Beam size | Time   | Max. GPU memory | Max. CPU memory |
|------------------|-----------|-----------|--------|-----------------|-----------------|
| openai/whisper   |    fp16    |        5   |  4m30s |      11325MB    |      9439MB     |
| faster-whisper   |    fp16    |        5   |    54s |       4755MB    |      3244MB     |
| faster-whisper   |    int8    |        5   |    59s |       3091MB    |      3117MB     |


- Small model on CPU
  
| Implementation   | Precision | Beam size | Time    | Max. memory |
|------------------|-----------|-----------|---------|-------------|
| openai/whisper   | fp32      | 5         | 10m31s  | 3101MB      |
| whisper.cpp      | fp32      | 5         | 17m42s  | 1581MB      |
| whisper.cpp      | fp16      | 5         | 12m39s  | 873MB       |
| faster-whisper   | fp32      | 5         | 2m44s   | 1675MB      |
| faster-whisper   | int8      | 5         | 2m04s   | 995MB       |


3. 
- Support for Multiple CPU Architectures: The system is compatible with x86-64 and AArch64/ARM64 processors, incorporating multiple optimized backends such as Intel MKL, oneDNN, OpenBLAS, Ruy, and Apple Accelerate. This ensures broad compatibility and efficient execution across different platforms.

- Automatic CPU Detection and Code Dispatch: The binary is designed to include multiple backends (e.g., Intel MKL and oneDNN) and instruction set architectures (e.g., AVX, AVX2). The appropriate backend and instruction set are automatically selected at runtime based on the detected CPU configuration.

- Parallel and Asynchronous Execution: The framework supports parallel and asynchronous processing of multiple batches using multiple GPUs or CPU cores, enabling efficient handling of large-scale computations.

- Dynamic Memory Management: Memory usage is dynamically adjusted according to request size, facilitated by caching allocators for both CPU and GPU. This approach ensures that performance requirements are met while optimizing memory utilization.

- Reduced Disk Footprint: Quantization techniques can reduce model sizes on disk by up to four times with minimal loss in accuracy, contributing to more efficient storage and deployment.

- Simple Integration: The project features minimal dependencies and provides straightforward APIs in Python and C++ for ease of integration, covering a broad range of integration scenarios.

- Configurable and Interactive Decoding: The framework offers advanced decoding capabilities, including the ability to autocomplete partial sequences and return alternative outputs at specific points in the sequence, enhancing flexibility and accuracy in inference.

- Tensor Parallelism for Distributed Inference: For very large models, the framework supports tensor parallelism, allowing models to be distributed across multiple GPUs. The documentation provides detailed instructions for setting up the necessary environment for distributed inference.

# CTranslate2 Whisper

## Preparation

To install Ctranslate2, run:

In [2]:
!pip install git+https://github.com/openai/whisper.git --quiet
!pip install transformers[torch]>=4.23 --quiet
!pip install --upgrade ctranslate2 --quiet

## Import libaries

In [16]:
import whisper
import numpy as np
import librosa
import torchaudio
from transformers import WhisperTokenizer, WhisperProcessor
import torch
import ctranslate2

Load audio and extract waveform, mel-spectrogram.

In [44]:
def load_audio(file_path, sr=16000):
    """
    Load audio file and convert it to the expected sample rate (16kHz).
    """
    audio, sample_rate = librosa.load(file_path, sr=sr)
    return audio, sample_rate

def compute_mel_spectrogram(audio, sample_rate=16000):
    """
    Compute the mel spectrogram from audio.
    """
    # Using torchaudio for mel spectrogram conversion
    waveform = torch.tensor(audio).unsqueeze(0)
    mel_spec_transform = torchaudio.transforms.MelSpectrogram(sample_rate=sample_rate, n_mels=80)
    mel_spectrogram = mel_spec_transform(waveform).squeeze(0).numpy()
    return mel_spectrogram

# Load audio and compute mel spectrogram
audio_file_path = './test_audio.wav'
text = "he is currently completing a film titled helloween"
audio, sample_rate = load_audio(audio_file_path)
mel_spectrogram = compute_mel_spectrogram(audio, sample_rate)

## Convert Whisper model to Ctranslate2 Whisper 

Whisper model and processor comes in various versions including whisper-tiny, whisper-small, whisper-base, whisper-large-v1, whisper-large-v2, and whisper-large-v3. You can load Whisper model from Hungging Face Hub using:

In [33]:
processor = WhisperProcessor.from_pretrained("openai/whisper-base")
tokenizer = processor.tokenizer

CTranslate2 provide command to converts the Whisper model to the CTranslate2 format, saves it in the whisper-base directory, copies the specified configuration files, applies float16 quantization, and overwrites any existing files.

In [8]:
#The original model was converted with the following command:
!ct2-transformers-converter --model openai/whisper-base --output_dir whisper-base \
    --copy_files tokenizer.json preprocessor_config.json --quantization float16 --force

config.json: 100%|█████████████████████████| 1.98k/1.98k [00:00<00:00, 7.93MB/s]
model.safetensors: 100%|█████████████████████| 290M/290M [00:11<00:00, 24.5MB/s]
generation_config.json: 100%|██████████████| 3.81k/3.81k [00:00<00:00, 33.3MB/s]


## CTranslate2 Whisper


Load the Ctranslate2 Whisper model using the code below and set it up to run on the CPU. Alternatively, it is possible to use a GPU by setting the device variable to "cuda".

In [29]:
translator = ctranslate2.models.Whisper("./whisper-base", device="cpu")



CTranslate2 encodes mel spectrograms into a StorageView using the ctranslate2.models.Whisper.encode() function. CTranslate2 uses StorageView as its input feature.

In [45]:
# Prepare the input for Whisper model
mel_spectrogram = np.expand_dims(mel_spectrogram, axis=0)  # Add batch dimension
print(mel_spectrogram.shape)
num_frames = mel_spectrogram.shape[2]  # Number of non-padding frames

# Ensure the array has contiguous memory
mel_spectrogram = np.ascontiguousarray(mel_spectrogram) # Use np.ascontiguousarray

# Encode the audio features
# Convert the NumPy array to a ctranslate2 StorageView
encoded_features = translator.encode(ctranslate2.StorageView.from_array(mel_spectrogram)) # Use from_array method

(1, 80, 773)


In [46]:
#mel_spectrogram = np.expand_dims(mel_spectrogram, axis=0)
mel_spectrogram.shape

(1, 80, 773)

In [31]:
encoded_features

 -0.168749 0.356358 0.230836 ... -0.983337 -0.258547 0.563757
[cpu:0 float32 storage viewed as 1x387x512]

**Methods of Ctranslate2 for the Whisper Model**

**1. Align**

Computes the alignment between the text tokens and the audio. This method is used to match parts of the text with corresponding segments of the audio.


In [24]:
# Tokenize the text
text_tokens = tokenizer(text, return_tensors="pt").input_ids.tolist()
start_sequence = [tokenizer.pad_token_id]  # Define start-of-sequence token

# Perform alignment
alignment_result = translator.align(
    features=encoded_features,
    start_sequence=start_sequence,
    text_tokens=text_tokens,
    num_frames=num_frames
)

print(alignment_result)

[WhisperAlignmentResult(alignments=[(0, 0), (1, 0), (2, 0), (3, 0), (4, 0), (5, 0), (5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6), (5, 7), (5, 8), (5, 9), (5, 10), (5, 11), (5, 12), (5, 13), (5, 14), (5, 15), (5, 16), (5, 17), (5, 18), (5, 19), (5, 20), (5, 21), (5, 22), (5, 23), (5, 24), (5, 25), (5, 26), (5, 27), (5, 28), (5, 29), (5, 30), (5, 31), (5, 32), (5, 33), (5, 34), (5, 35), (5, 36), (5, 37), (5, 38), (5, 39), (5, 40), (5, 41), (5, 42), (5, 43), (5, 44), (5, 45), (5, 46), (5, 47), (5, 48), (5, 49), (5, 50), (5, 51), (5, 52), (5, 53), (5, 54), (5, 55), (5, 56), (5, 57), (5, 58), (5, 59), (5, 60), (5, 61), (5, 62), (5, 63), (5, 64), (5, 65), (5, 66), (5, 67), (5, 68), (5, 69), (5, 70), (5, 71), (5, 72), (5, 73), (5, 74), (5, 75), (5, 76), (5, 77), (5, 78), (5, 79), (5, 80), (5, 81), (5, 82), (5, 83), (5, 84), (5, 85), (5, 86), (5, 87), (5, 88), (5, 89), (5, 90), (5, 91), (5, 92), (5, 93), (5, 94), (5, 95), (5, 96), (5, 97), (5, 98), (5, 99), (5, 100), (5, 101), (5, 102), (5, 

In addition to the input audio features, we also need to provide other input parameters for this method, including:

- *start_sequence* is the initial set of tokens or starting point for the
alignment process.
- *text_tokens* are the tokens of the text that the audio features should be aligned with.
- *num_frames* is the number of non-padding frames in the audio features.

Example Result

```python
[WhisperAlignmentResult(
    alignments=[(0, 0), (1, 0), (2, 0), (3, 0), (4, 0), (5, 0), (5, 1), (5, 2), (5, 3),...],
    text_token_probs=[0.0, 0.0, 9.31674730964005e-05, 0.00015034245734568685, 9.820470586419106e-05, 1.2290715858398471e-05,...]
)]
```

Here’s a simplified breakdown:

- Alignments: A list of tuples where each tuple represents a mapping between the audio frames and text tokens. For instance, the tuple (5, 0) indicates that the audio frame 5 corresponds to text token 0.
- Text Token Probabilities: A list of probabilities for each text token, indicating the confidence level for each token being represented in the audio. For example, a probability of 0.00015034245734568685 suggests a certain level of confidence for the corresponding text token.

**2. Detect language**

Detects the probability of each language present in the audio input.


In [36]:
detected_language = translator.detect_language(encoded_features)
print(detected_language)

[[('<|en|>', 0.5479803085327148), ('<|zh|>', 0.2321370244026184), ('<|ko|>', 0.03849892318248749), ('<|hi|>', 0.025631628930568695), ('<|ur|>', 0.019510192796587944), ('<|nn|>', 0.018716374412178993), ('<|th|>', 0.01430478598922491), ('<|ar|>', 0.01314567681401968), ('<|haw|>', 0.010905093513429165), ('<|es|>', 0.008537559770047665), ('<|fi|>', 0.008375878445804119), ('<|mi|>', 0.007494110614061356), ('<|jw|>', 0.007309376262128353), ('<|id|>', 0.007267948240041733), ('<|pl|>', 0.006108843255788088), ('<|tl|>', 0.005606784485280514), ('<|cy|>', 0.003739970503374934), ('<|ja|>', 0.0030162970069795847), ('<|la|>', 0.002473728731274605), ('<|ca|>', 0.0017497289227321744), ('<|km|>', 0.001358815236017108), ('<|vi|>', 0.00132815632969141), ('<|ms|>', 0.0011248827213421464), ('<|el|>', 0.0009848788613453507), ('<|sn|>', 0.000962412916123867), ('<|fa|>', 0.0009130383259616792), ('<|de|>', 0.0008172831730917096), ('<|hu|>', 0.0008067761082202196), ('<|bn|>', 0.0006981399492360651), ('<|fo|>', 

Example result:
```python
[[('<|en|>', 0.4817270040512085), ('<|zh|>', 0.13297148048877716), ('<|ur|>', 0.05135079845786095), ('<|ko|>', 0.0501541793346405), ('<|hi|>', 0.04444431513547897), ('<|jw|>', 0.03632590174674988), ('<|nn|>', 0.025384925305843353), ('<|th|>', 0.021732434630393982), ('<|ar|>', 0.017658809199929237), ('<|fi|>', 0.017540380358695984), ('<|tl|>', 0.013759175315499306), ('<|mi|>', 0.01273771096020937), ('<|es|>', 0.010259411297738552)]]
```

The output from the detected_language method provides a list of language codes with their associated probabilities. Each entry in the list consists of a language code and a probability score, representing the likelihood of that language being present in the provided audio features.

**3. Generate**

The generate method creates text from audio features and prompts. It processes the data using different decoding strategies and parameters, allowing customization of output length, token penalties, and additional information such as scores and probabilities.

In [39]:
encoded_features.shape

[1, 387, 512]

In [48]:
mel_spectrogram.shape

(1, 80, 773)

In [47]:
from typing import List, Union, Optional


generate_params = {
    "beam_size": 5,
    "patience": 5,  # Increased for longer sequences
    "length_penalty": 1.0,
    "repetition_penalty": 1.0,  # Reduced to allow more repetition if needed
    "no_repeat_ngram_size": 0,  # Disabled to allow any sequence
    "max_length": 448,
    "return_scores": True,
    "return_no_speech_prob": True
}

prompts = tokenizer.convert_tokens_to_ids(
        [
            "<|startoftranscript|>",
            "<|en|>",
            "<|transcribe|>",
        ]
    )

# Generate text from audio
generation_results = translator.generate(mel_spectrogram, [prompts], **generate_params) # Pass features and prompts as arguments
print(generation_results)

TypeError: generate(): incompatible function arguments. The following argument types are supported:
    1. (self: ctranslate2._ext.Whisper, features: ctranslate2._ext.StorageView, prompts: Union[List[List[str]], List[List[int]]], *, asynchronous: bool = False, beam_size: int = 5, patience: float = 1, num_hypotheses: int = 1, length_penalty: float = 1, repetition_penalty: float = 1, no_repeat_ngram_size: int = 0, max_length: int = 448, return_scores: bool = False, return_logits_vocab: bool = False, return_no_speech_prob: bool = False, max_initial_timestamp_index: int = 50, suppress_blank: bool = True, suppress_tokens: Optional[List[int]] = [-1], sampling_topk: int = 1, sampling_temperature: float = 1) -> Union[List[ctranslate2._ext.WhisperGenerationResult], List[ctranslate2._ext.WhisperGenerationResultAsync]]

Invoked with: <ctranslate2._ext.Whisper object at 0x7004254d4a30>, array([[[2.86415201e-07, 1.02543504e-06, 5.53169630e-05, ...,
         8.36991512e-06, 6.97592350e-06, 1.39724807e-05],
        [1.03684192e-06, 3.71214242e-06, 2.00251059e-04, ...,
         3.02996468e-05, 2.52533027e-05, 5.05813005e-05],
        [4.88475571e-09, 4.94389242e-05, 4.53816756e-04, ...,
         2.78110638e-05, 2.52326117e-05, 8.67478448e-05],
        ...,
        [1.30104479e-06, 1.14316400e-03, 1.91134736e-01, ...,
         1.05329118e-05, 1.09185248e-05, 2.81049743e-05],
        [1.36406572e-06, 1.95166166e-03, 4.91011709e-01, ...,
         1.49687357e-05, 1.93674987e-05, 6.93914171e-06],
        [2.05967581e-06, 4.18051437e-04, 1.38342595e-02, ...,
         3.51833501e-06, 3.52113557e-06, 1.36587471e-06]]], dtype=float32), [[50258, 50259, 50359]]; kwargs: beam_size=5, patience=5, length_penalty=1.0, repetition_penalty=1.0, no_repeat_ngram_size=0, max_length=448, return_scores=True, return_no_speech_prob=True

Since the result is returned in the form of tokens, the decode function is used to convert the token sequence into standard text.

In [27]:
# Token IDs from the result
token_ids = generation_results[0].sequences_ids[0]

# Decode the token IDs to text
decoded_text = tokenizer.decode(token_ids, skip_special_tokens=True)

print(decoded_text)




# Faster Whisper transcription with CTranslate2

In the previous section, we explored CTranslate2 and the functionality of its various methods. In this section, we will introduce a specific implementation of the Whisper model, leveraging the operational principles of CTranslate2, known as Faster-Whisper.

Faster-Whisper represents a reimplementation of OpenAI's Whisper model utilizing CTranslate2, a high-performance inference engine designed for Transformer models. This implementation achieves up to a fourfold increase in inference speed compared to openai/whisper, while maintaining equivalent accuracy and requiring less memory. Furthermore, its efficiency can be enhanced through 8-bit quantization on both CPU and GPU, optimizing performance further.

## Installation

In [None]:
!pip install faster-whisper -U
!pip install transformers -U

**Sequential Inference faster-whisper**

First, we use the Sequential inference method for the Whisper model, where the input audio is segmented and processed sequentially. Each audio segment is transcribed one by one, ensuring the model handles the input step by step.

Here’s a summary of how Sequential inference method handles the transcription:
- Sequential Processing: this method segments the audio and processes each segment in sequence. This allows the model to generate text from each audio chunk while maintaining context from previous segments.
- Language Detection & VAD: It can automatically detect the language if not specified and filter out non-speech segments using voice activity detection.
- Final Output: this method returns a generator that yields transcribed segments and additional transcription details.

In [4]:
import time
from faster_whisper import WhisperModel

# Define the file path for the audio to be transcribed
filepath = "/content/1518.wav"

# Configuration parameters for transcription
word_timestamps = False  # Do not include word-level timestamps
vad_filter = True  # Apply Voice Activity Detection to remove non-speech segments
temperature = 0.0  # Use deterministic transcription
language = "en"  # Set language to English
model_size = "large-v3"  # Use the "large-v3" model
device, compute_type = "cpu", "float32"  # Set computation device and precision

# Initialize the Whisper model
model = WhisperModel(model_size, device=device, compute_type=compute_type)

# Transcribe the audio file with the specified settings
segments, transcription_info = model.transcribe(
    filepath,
    word_timestamps=word_timestamps,
    vad_filter=vad_filter,
    temperature=temperature,
    language=language,
)

# Output the transcription results and metadata
print(segments, transcription_info)



<generator object restore_speech_timestamps at 0x794e11d14c10> TranscriptionInfo(language='en', language_probability=1, duration=5.064, duration_after_vad=4.408, all_language_probs=None, transcription_options=TranscriptionOptions(beam_size=5, best_of=5, patience=1, length_penalty=1, repetition_penalty=1, no_repeat_ngram_size=0, log_prob_threshold=-1.0, log_prob_low_threshold=None, no_speech_threshold=0.6, compression_ratio_threshold=2.4, condition_on_previous_text=True, prompt_reset_on_temperature=0.5, temperatures=[0.0], initial_prompt=None, prefix=None, suppress_blank=True, suppress_tokens=(1, 2, 7, 8, 9, 10, 14, 25, 26, 27, 28, 29, 31, 58, 59, 60, 61, 62, 63, 90, 91, 92, 93, 359, 503, 522, 542, 873, 893, 902, 918, 922, 931, 1350, 1853, 1982, 2460, 2627, 3246, 3253, 3268, 3536, 3846, 3961, 4183, 4667, 6585, 6647, 7273, 9061, 9383, 10428, 10929, 11938, 12033, 12331, 12562, 13793, 14157, 14635, 15265, 15618, 16553, 16604, 18362, 18956, 20075, 21675, 22520, 26130, 26161, 26435, 28279, 2

Output

segments

```python
<generator object restore_speech_timestamps at 0x7ec0dd52edc0>
```

transcription_info

```python
TranscriptionInfo(language='en', language_probability=1, duration=5.064, duration_after_vad=4.408, all_language_probs=None, transcription_options=TranscriptionOptions(beam_size=5, best_of=5, patience=1, length_penalty=1, repetition_penalty=1, no_repeat_ngram_size=0, log_prob_threshold=-1.0, no_speech_threshold=0.6, compression_ratio_threshold=2.4, condition_on_previous_text=True, prompt_reset_on_temperature=0.5, temperatures=[0.0], initial_prompt=None, prefix=None, suppress_blank=True, suppress_tokens=[-1], without_timestamps=False, max_initial_timestamp=1.0, word_timestamps=False, prepend_punctuations='"\'“¿([{-', append_punctuations='"\'.。,，!！?？:：”)]}、', max_new_tokens=None, clip_timestamps='0', hallucination_silence_threshold=None, hotwords=None), vad_options=VadOptions(threshold=0.5, min_speech_duration_ms=250, max_speech_duration_s=inf, min_silence_duration_ms=2000, speech_pad_ms=400))
```



Segments are transcription segments extracted from the audio file, each represented as an object. Each segment includes information about the audio portion it corresponds to, such as start time, end time, and transcribed text. By iterating through these segments, you can extract and process the entire transcription content.

In [2]:
for segment in segments:
    row = {
        "start": segment.start,
        "end": segment.end,
        "text": segment.text,
    }
    if word_timestamps:
        row["words"] = [
            {"start": word.start, "end": word.end, "word": word.word}
            for word in segment.words
        ]
    print(row)

{'start': 0.66, 'end': 3.9, 'text': ' He is currently completing a film titled Halloween.'}


Additionally, to extracts word-level timestamps from an audio file, providing the start and end times for each word. This detailed information facilitates in-depth analysis and processing of the transcription data.

In [3]:
segments, _ = model.transcribe("/content/1518.wav", word_timestamps=True)

for segment in segments:
    for word in segment.words:
        print("[%.2fs -> %.2fs] %s" % (word.start, word.end, word.word))

[0.00s -> 1.06s]  He
[1.06s -> 1.18s]  is
[1.18s -> 1.50s]  currently
[1.50s -> 2.06s]  completing
[2.06s -> 2.34s]  a
[2.34s -> 2.54s]  film
[2.54s -> 2.96s]  titled
[2.96s -> 3.70s]  Halloween.


**Batched inference faster-whisper**

Parallel processing of audio chunks can significantly enhance inference performance compared to sequential processing methods, such as those used in sequential inference with Faster Whisper. This method involves:

- Breaking the audio into semantically meaningful chunks.
- Transcribing these chunks in parallel (as batches), utilizing a faster feature extraction process and VAD to skip non-speech portions.

This approach results in considerably faster transcription, especially for long audio files, without sacrificing accuracy.



In [1]:
from faster_whisper import WhisperModel, BatchedInferencePipeline

# Define file path for the audio to be transcribed
filename = "/content/1518.wav"

# Configuration settings
word_timestamps = False  # Disable word-level timestamps
vad_filter = True  # Enable Voice Activity Detection to filter out non-speech segments
temperature = 0.0  # Set temperature to 0.0 for deterministic transcription
language = "en"  # Set language to English
model_size = "large-v3"  # Use the "large-v3" Whisper model
device, compute_type = "cpu", "float32"  # Set computation to use CPU and float32 precision

# Initialize the Whisper model
model = WhisperModel(model_size, device=device, compute_type=compute_type)

# Wrap the model in a BatchedInferencePipeline for batch processing
batched_model = BatchedInferencePipeline(model=model)

# Perform transcription on the audio file using batch processing with batch_size of 16
segments, info = batched_model.transcribe(filename, batch_size=16)

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.4.0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../usr/local/lib/python3.10/dist-packages/faster_whisper/assets/pyannote_vad_model.bin`


Model was trained with pyannote.audio 0.0.1, yours is 3.3.2. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.4.1+cu121. Bad things might happen unless you revert torch to 1.x.
[0.94s -> 4.43s]  He is currently completing a film titled Halloween.


**Multi-segment language detection**

This language detection method uses the detect_language_multi_segment method to analyze an audio file. This method segments the audio into multiple parts and determines the language based on highly-confident segments, aggregating these results to accurately identify the overall language. By leveraging a segmented approach, the method enhances reliability and accuracy, particularly in cases with varying audio quality.

In [None]:
from faster_whisper import WhisperModel

model = WhisperModel("medium", device="cpu", compute_type="float32")
language_info = model.detect_language_multi_segment("audio.mp3")
print(language_info)

Finally, we will test all three inference methods, including the Whisper model from OpenAI, Faster Whisper Sequential, and Faster Whisper Batched, using the same audio sample. This will allow us to compare the speed and results of each method.

In [None]:
import time
from faster_whisper import WhisperModel, BatchedInferencePipeline
import whisper

# Define the file path for the audio to be transcribed
filepath = "/content/1518.wav"

# Configuration parameters for transcription
word_timestamps = False  # Do not include word-level timestamps
vad_filter = True  # Apply Voice Activity Detection to remove non-speech segments
temperature = 0.0  # Use deterministic transcription
language = "en"  # Set language to English
model_size = "large-v3"  # Use the "large-v3" model
device, compute_type = "cpu", "float32"  # Set computation device and precision

# Initialize the Whisper model
model = WhisperModel(model_size, device=device, compute_type=compute_type)

time1 = time.time()
# Transcribe the audio file with the specified settings
segments, transcription_info = model.transcribe(
    filepath,
    word_timestamps=word_timestamps,
    vad_filter=vad_filter,
    temperature=temperature,
    language=language,
)
time2 = time.time()

# Wrap the model in a BatchedInferencePipeline for batch processing
batched_model = BatchedInferencePipeline(model=model)

time3 = time.time()
# Perform transcription on the audio file using batch processing with batch_size of 16
segments, info = batched_model.transcribe(filepath, batch_size=16)
time4 = time.time()

# Load the Whisper model
model = whisper.load_model(model_size, device=device)

time5 = time.time()
# Perform transcription
result = model.transcribe(file_path, language=language)
time6 = time.time()

print("Time inference from Whisper model from OpenAI: ", time6 - time5)
print("Time inference from Faster Whisper Sequential: ", time2 - time1)
print("Time inference from Faster Whisper Batched: ", time4 - time3)


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.4.0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../usr/local/lib/python3.10/dist-packages/faster_whisper/assets/pyannote_vad_model.bin`


Model was trained with pyannote.audio 0.0.1, yours is 3.3.2. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.4.1+cu121. Bad things might happen unless you revert torch to 1.x.


In [52]:
import ctranslate2
import librosa
import transformers

# Load and resample the audio file.
audio, _ = librosa.load("test_audio.wav", sr=16000, mono=True)

# Compute the features of the first 30 seconds of audio.
processor = transformers.WhisperProcessor.from_pretrained("openai/whisper-base")
inputs = processor(audio, return_tensors="np", sampling_rate=16000)
features = ctranslate2.StorageView.from_array(inputs.input_features)

In [58]:
features.shape

[1, 80, 3000]

In [59]:
encoded_features.shape

[1, 387, 512]

In [53]:
# Load the model on CPU.
model = ctranslate2.models.Whisper("whisper-base")



In [61]:
# Detect the language.
results = model.detect_language(features)
language, probability = results[0][0]
print("Detected language %s with probability %f" % (language, probability))

# Describe the task in the prompt.
prompt = processor.tokenizer.convert_tokens_to_ids(
    [
        "<|startoftranscript|>",
        language,
        "<|transcribe|>",
        "<|notimestamps|>",  # Remove this token to generate timestamps.
    ]
)

# Run generation for the 30-second window.
results = model.generate(features, [prompt])
transcription = processor.decode(results[0].sequences_ids[0])
print("Generate: ", transcription)

Detected language <|en|> with probability 0.996952
Generate:   Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the exhibition.
