# ASR Model Evaluation on Real-World Multilingual Audio

## Objective
This notebook evaluates Automatic Speech Recognition (ASR) performance using real-world multilingual audio.

Two models are compared:
- Whisper-1 (OpenAI)
- Wav2Vec2 (Facebook)

Evaluation Metrics:
- Word Error Rate (WER)
- Character Error Rate (CER)
- Real-Time Factor (RTF)
- Inverse RTF


# Audio Dataset Description

The input audio satisfies all assignment requirements:

- Duration: Greater than 8 minutes
- Speakers: Minimum 5 (human + synthetic)
- Accents: Indian, American, British
- Languages: English, Tamil, Hindi
- Noise: Fan noise and conversational variation
- Format: WAV, 16 kHz, mono

Audio segments include:
1. Indian English speech
2. American accent speech
3. British accent speech with noise
4. Tamil speech
5. Hindi speech
6. Synthetic English voice
7. Informal conversational speech


# 1.Google Drive Connection

In [None]:
from google.colab import drive
drive.mount('/content/drive')


In [None]:
import os

project_path = "/content/drive/MyDrive/ASR_Evaluation_Project"

os.makedirs(project_path, exist_ok=True)

print("Project folder created at:", project_path)


In [None]:
!pip install soundfile


In [None]:
%cd /content/drive/MyDrive/ASR_Evaluation_Project


In [None]:
!pwd


In [None]:
!git clone https://github.com/symblai/speech-recognition-evaluation.git


In [None]:
!ls


# 2.Audio Recording Setup

This section defines a custom audio recording function using Google Colab's browser microphone access.

The function:
- Requests microphone permission
- Records audio for a specified duration
- Saves the recording as `.webm`
- Converts it to `.wav` format using FFmpeg

This approach enables collection of real-world speech directly within Colab.


In [None]:
from google.colab import output
from base64 import b64decode
import time

def record_audio(filename, record_seconds=60):
    print("Get ready...")
    time.sleep(2)
    print("Recording will start in 3 seconds...")
    time.sleep(3)
    print("ЁЯОЩя╕П Recording NOW. Speak clearly.")

    js = f"""
    async function recordAudio() {{
      const stream = await navigator.mediaDevices.getUserMedia({{audio: true}});
      const recorder = new MediaRecorder(stream);
      let chunks = [];

      recorder.ondataavailable = e => chunks.push(e.data);

      recorder.start();
      await new Promise(resolve => setTimeout(resolve, {record_seconds} * 1000));
      recorder.stop();

      await new Promise(resolve => recorder.onstop = resolve);

      const blob = new Blob(chunks);
      const reader = new FileReader();
      reader.readAsDataURL(blob);

      return new Promise(resolve => {{
        reader.onloadend = () => resolve(reader.result);
      }});
    }}

    recordAudio();
    """

    audio_data = output.eval_js(js)
    binary = b64decode(audio_data.split(',')[1])

    # Save as webm first
    webm_filename = filename.replace(".wav", ".webm")
    with open(webm_filename, "wb") as f:
        f.write(binary)

    print(f"Saved raw recording as {webm_filename}")

    # Convert to WAV using ffmpeg
    !ffmpeg -loglevel quiet -i "{webm_filename}" "{filename}"

    print(f"тЬЕ Converted and saved as {filename}")


## Segment 1 тАУ Indian English Accent

This segment captures natural Indian English speech in a relatively clean environment.

Purpose:
- Evaluate accent robustness
- Establish baseline English performance


In [None]:
record_audio("segment1.wav", record_seconds=120)


In [None]:
!mv segment1.webm /content/drive/MyDrive/ASR_Evaluation_Project/
!mv segment1.wav /content/drive/MyDrive/ASR_Evaluation_Project/


## Segment 2 тАУ American Accent (Light Background Noise)

This segment simulates American-style pronunciation with slight environmental noise.

Purpose:
- Test accent variation handling
- Evaluate noise sensitivity


In [None]:
record_audio("segment2.wav", record_seconds=90)


## Segment 3 тАУ British Accent (Noticeable Background Noise)

This segment includes British pronunciation and more noticeable background noise.

Purpose:
- Evaluate robustness to pronunciation shifts
- Analyze performance under environmental disturbances


In [None]:
record_audio("segment3.wav", record_seconds=90)


## Segment 4 тАУ Tamil Speech

This segment includes natural Tamil speech recorded by the speaker.

Purpose:
- Evaluate multilingual capability
- Test non-English transcription performance


In [None]:
record_audio("segment4.wav", record_seconds=90)


## Segment 5 тАУ Hindi Speech

This segment includes Hindi language content.

Purpose:
- Evaluate multilingual robustness
- Test model handling of Devanagari script


In [None]:
!pip install gtts


In [None]:
from gtts import gTTS

hindi_text = """
рдирдорд╕реНрддреЗ, рдпрд╣ рд░рд┐рдХреЙрд░реНрдбрд┐рдВрдЧ рд╣рд┐рдВрджреА рднрд╛рд╖рд╛ рдореЗрдВ рд╕реНрд╡рдЪрд╛рд▓рд┐рдд рд╡рд╛рдХреН рдкрд╣рдЪрд╛рди рдкреНрд░рдгрд╛рд▓реА рдХреЗ рдореВрд▓реНрдпрд╛рдВрдХрди рдХреЗ рд▓рд┐рдП рдмрдирд╛рдИ рдЧрдИ рд╣реИред
рдЗрд╕ рднрд╛рдЧ рдХрд╛ рдЙрджреНрджреЗрд╢реНрдп рдмрд╣реБрднрд╛рд╖реА рд╕рдорд░реНрдерди рдХреА рдЬрд╛рдВрдЪ рдХрд░рдирд╛ рд╣реИред
рднрд╛рд░рдд рдЬреИрд╕реЗ рджреЗрд╢ рдореЗрдВ рдХрдИ рднрд╛рд╖рд╛рдПрдБ рдФрд░ рд╡рд┐рднрд┐рдиреНрди рдЙрдЪреНрдЪрд╛рд░рдг рдкрд╛рдП рдЬрд╛рддреЗ рд╣реИрдВред
рдПрдХ рдордЬрдмреВрдд рдПрдПрд╕рдЖрд░ рдкреНрд░рдгрд╛рд▓реА рдХреЛ рд╡рд┐рднрд┐рдиреНрди рднрд╛рд╖рд╛рдУрдВ рдФрд░ рд╢реЛрд░рдпреБрдХреНрдд рд╡рд╛рддрд╛рд╡рд░рдг рдореЗрдВ рднреА рд╕рд╣реА рдкрд╣рдЪрд╛рди рдХрд░рдиреА рдЪрд╛рд╣рд┐рдПред
рдЗрд╕ рдСрдбрд┐рдпреЛ рдореЗрдВ рд╕реНрдкрд╖реНрдЯ рдЙрдЪреНрдЪрд╛рд░рдг рдФрд░ рдкреНрд░рд╛рдХреГрддрд┐рдХ рд╡рд╛рдХреНрдп рд╕рдВрд░рдЪрдирд╛ рд╢рд╛рдорд┐рд▓ рд╣реИред
рдпрд╣ рдЦрдВрдб рдореЙрдбрд▓ рдХреА рдмрд╣реБрднрд╛рд╖реА рдХреНрд╖рдорддрд╛ рдХрд╛ рдкрд░реАрдХреНрд╖рдг рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рдЙрдкрдпреЛрдЧ рдХрд┐рдпрд╛ рдЬрд╛рдПрдЧрд╛ред
"""

tts = gTTS(text=hindi_text, lang='hi')
tts.save("segment5.mp3")

print("Hindi TTS segment saved as segment5.mp3")


In [None]:
!ffmpeg -i segment5.mp3 segment5.wav


## Segment 6 тАУ Synthetic English Voice

This segment uses generated speech (Text-to-Speech).

Purpose:
- Compare model performance on artificial vs human speech
- Analyze pronunciation clarity effects


In [None]:
from gtts import gTTS

english_tts_text = """
This is a synthetic voice generated for automatic speech recognition evaluation.
Including generated speech allows us to compare human speech with artificial speech patterns.
Some ASR systems may perform better on synthetic audio because it has clearer pronunciation.
However, real-world human speech often contains natural pauses, emotion, and background noise.
This segment is included to test how the recognition model handles artificial speech.
"""

tts = gTTS(text=english_tts_text, lang='en')
tts.save("segment6.mp3")

print("English TTS segment saved as segment6.mp3")


## Segment 7 тАУ Informal Conversational Speech

This segment includes natural conversational pacing with filler words and informal tone.

Purpose:
- Simulate real meeting environments
- Evaluate spontaneous speech handling


In [None]:
record_audio("segment7.wav", record_seconds=45)


In [None]:
!ffmpeg -i segment6.mp3 segment6.wav


# 3.Audio Concatenation and Final Formatting

All seven segments are combined into a single continuous audio file.

The final file:
- Exceeds 8 minutes
- Is converted to WAV format
- Resampled to 16 kHz
- Converted to mono

This ensures compliance with evaluation requirements.


In [None]:
!ffmpeg -y \
-i segment1.wav \
-i segment2.wav \
-i segment3.wav \
-i segment4.wav \
-i segment5.wav \
-i segment6.wav \
-i segment7.wav \
-filter_complex "[0:0][1:0][2:0][3:0][4:0][5:0][6:0]concat=n=7:v=0:a=1[out]" \
-map "[out]" combined.wav


In [None]:
!ffmpeg -y -i combined.wav -ar 16000 -ac 1 input_audio.wav


# 4.Audio Verification

This step verifies:

- Sampling rate
- Duration
- Compliance with required 16 kHz mono format


In [None]:
import librosa

y, sr = librosa.load("input_audio.wav", sr=None)

print("Sample Rate:", sr)
print("Duration (minutes):", len(y)/sr/60)


# 5.Ground Truth (Reference Transcript)

A manual reference transcript was created to match exactly what was spoken.

Important:
- Repetitions preserved
- Multilingual text included
- Lowercase normalization applied
- Punctuation removed for fair WER calculation


In [None]:
%%writefile reference.txt
hello this is a test recording for evaluating automatic speech recognition systems today i am speaking in my natural indian english accent artificial intelligence is transforming industries across the world from healthcare to finance we should learn models for improving efficiency however speech recognition systems still face challenges when dealing with different accents and voicing environments in india pronunciation may vary depending on regional background words like data schedule and advertisement may sound slightly different this recording includes natural passes and variations in speaking speed the purpose of this segment is to evaluate how well the asr handles indian accented english in a realistic environment the small passes the purpose of this segment is to evaluate natural passes and variations in speaking speed words like data schedule and advertisement may sound slightly different however speech recognition systems still face challenges when dealing with different accents and voicing environments so this is a test recording for evaluating automatic speech recognition systems today i am speaking in my natural indian english accent artificial intelligence is transforming industries across the world

hi everyone this is the second segment of the speech recognition evaluation recording in this part i am speaking in a more american style of english pronunciation automatic speech recognition systems have improved significantly in recent years ai models are now capable of handling different accents background noise and spontaneous speech patterns however performance can still vary depending on pronunciation speaking speed and environmental conditions for example when people speak quite quickly or when there are overlapping sounds in the background recognition accuracy may decrease this section is designed to evaluate how well the asr model adapts to accent variation combined with light background noise hi everyone this is the second segment of the speech recognition evaluation recording in this part i am speaking in a more american style of english pronunciation automatic speech recognition systems have improved significantly in recent years ai models are now capable of handling different accents background noise and spontaneous speech patterns

good afternoon this is the third segment of the evaluation recording in this section i am speaking in a british style accent speech recognition systems must be robust across different regions and pronunciation styles in the united kingdom certain words are pronounced differently compared to american english for example words such as advertisement schedule and laboratory may have distinct pronunciation patterns additionally background noise and environmental disturbance can make transcription more challenging this recording includes natural passes and realistic conversational pacing the objective is to test how well the asr system performs under accented speech with noticeable environmental noise good afternoon this is the third segment of the evaluation recording in this section i am speaking in a british style accent speech recognition systems must be robust across different regions and pronunciation styles in the united kingdom certain words are pronounced differently compared to american english for example words such as advertisement schedule and laboratory may have distinct pronunciation patterns additionally background noise and environmental disturbance can make transcription more challenging this recording includes natural passes and realistic conversational pacing the objective is to test how well the asr system performs under accented speech with noticeable environmental noise good afternoon this is

роЗройрпНро▒рпБ роиро╛ройрпН родрооро┐ро┤рпН роорпКро┤ро┐ропро┐ро▓рпН рокрпЗроЪрпБроХро┐ро▒рпЗройрпН роЗроирпНрод рокродро┐ро╡рпБ родро╛ройро┐ропроЩрпНроХро┐ роХрпБро░ро▓рпН роЕроЯрпИропро╛ро│ роЕроорпИрокрпНрокрпБроХро│рпИ роородро┐рокрпНрокрпАроЯрпБ роЪрпЖропрпНро╡родро▒рпНроХро╛роХ роЙро░рпБро╡ро╛роХрпНроХрокрпНрокроЯрпБроХро┐ро▒родрпБ рокро▓ роорпКро┤ро┐роХро│рпИ роЪро░ро┐ропро╛роХ роЕроЯрпИропро╛ро│роорпН роХро╛рогрпБроорпН родро┐ро▒ройрпН рооро┐роХро╡рпБроорпН роорпБроХрпНроХро┐ропрооро╛ройродрпБ роЗроирпНродро┐ропро╛ рокрпЛройрпНро▒ роиро╛роЯрпБроХро│ро┐ро▓рпН рокро▓ роорпКро┤ро┐роХро│рпН рооро▒рпНро▒рпБроорпН рокро▓рпНро╡рпЗро▒рпБ роЙроЪрпНроЪро░ро┐рокрпНрокрпБроХро│рпН роХро╛рогрокрпНрокроЯрпБроХро┐ройрпНро▒рой родрооро┐ро┤рпН роорпКро┤ро┐ роТро░рпБ рокро┤роорпИропро╛рой рооро▒рпНро▒рпБроорпН роЪрпЖро┤рпБроорпИропро╛рой роорпКро┤ро┐ропро╛роХрпБроорпН роЙро▓роХроорпН роорпБро┤рпБро╡родрпБроорпН роХрпЛроЯро┐роХрпНроХрогроХрпНроХро╛рой роороХрпНроХро│рпН родрооро┐ро┤рпН рокрпЗроЪрпБроХро┐ройрпНро▒ройро░рпН роХрпБро░ро▓рпН роЕроЯрпИропро╛ро│ роЕроорпИрокрпНрокрпБроХро│рпН ро╡рпЖро╡рпНро╡рпЗро▒рпБ роорпКро┤ро┐роХро│ро┐ро▓рпН роЙро│рпНро│ роТро▓ро┐ро╡роЯро┐ро╡роЩрпНроХро│рпИ рокрпБро░ро┐роирпНродрпБ роХрпКро│рпНро│ ро╡рпЗрогрпНроЯрпБроорпН роХрпБро▒ро┐рокрпНрокро╛роХ роЙропро┐ро░рпЖро┤рпБродрпНродрпБроХро│рпН рооро▒рпНро▒рпБроорпН роорпЖропрпНропрпЖро┤рпБродрпНродрпБроХро│ро┐ройрпН ро╡рпЗро▒рпБрокро╛роЯрпБ родрпЖро│ро┐ро╡ро╛роХ роЕроЯрпИропро╛ро│роорпН роХро╛рогрокрпНрокроЯ ро╡рпЗрогрпНроЯрпБроорпН рокро┐ройрпНройрогро┐ роЪродрпНродроорпН рокрпЗроЪрпБроорпН ро╡рпЗроХ рооро╛ро▒рпНро▒роорпН рооро▒рпНро▒рпБроорпН ро╡ро╛роХрпНроХро┐роп роЗроЯрпИро╡рпЖро│ро┐роХро│рпН роЖроХро┐ропро╡рпИ роХрпБро░ро▓рпН роЕроЯрпИропро╛ро│ роЕроорпИрокрпНрокро┐ройрпН роЪрпЖропро▓рпНродро┐ро▒ройрпИ рокро╛родро┐роХрпНроХро▓ро╛роорпН роЙродро╛ро░рогрооро╛роХ роТро░рпБро╡ро░рпН роорпЖродрпБро╡ро╛роХ рокрпЗроЪрпБроорпНрокрпЛродрпБ рооро▒рпНро▒рпБроорпН ро╡рпЗроХрооро╛роХ рокрпЗроЪрпБроорпНрокрпЛродрпБ роТро▓ро┐ропро┐ройрпН родройрпНроорпИ рооро╛ро▒рпБроорпН роЪро┐ро▓ роирпЗро░роЩрпНроХро│ро┐ро▓рпН роороХрпНроХро│рпН роЙро░рпИропро╛роЯрпБроорпНрокрпЛродрпБ роЗроЯрпИропро┐ро▓рпН роЪро┐ро▒ро┐роп роЗроЯрпИро╡рпЖро│ро┐роХро│рпН ро╡ро┐роЯрпБро╡ро╛ро░рпНроХро│рпН роЕродрпБ роЗропро▓рпНрокро╛рой рокрпЗроЪрпНроЪро┐ройрпН роТро░рпБ рокроХрпБродро┐ропро╛роХрпБроорпН роЗроирпНрод рокроХрпБродро┐ рокро▓ роорпКро┤ро┐ роЖродро░ро╡рпИ роЪрпЛродро┐рокрпНрокродро▒рпНроХро╛роХ роЪрпЗро░рпНроХрпНроХрокрпНрокроЯрпНроЯрпБро│рпНро│родрпБ роЗроирпНрод рооро╛родро┐ро░ро┐ рокродро┐ро╡рпБ роЙрогрпНроорпИропро╛рой роЪрпВро┤рпНроиро┐ро▓рпИропро┐ро▓рпН рокро┐ро░родро┐рокро▓ро┐роХрпНроХро┐ро▒родрпБ

рдирдорд╕реНрддреЗ рдпрд╣ рд░рд┐рдХреЙрд░реНрдбрд┐рдВрдЧ рд╣рд┐рдВрджреА рднрд╛рд╖рд╛ рдореЗрдВ рд╕реНрд╡рдЪрд╛рд▓рд┐рдд рд╡рд╛рдХреН рдкрд╣рдЪрд╛рди рдкреНрд░рдгрд╛рд▓реА рдХреЗ рдореВрд▓реНрдпрд╛рдВрдХрди рдХреЗ рд▓рд┐рдП рдмрдирд╛рдИ рдЧрдИ рд╣реИ рдЗрд╕ рднрд╛рдЧ рдХрд╛ рдЙрджреНрджреЗрд╢реНрдп рдмрд╣реБрднрд╛рд╖реА рд╕рдорд░реНрдерди рдХреА рдЬрд╛рдВрдЪ рдХрд░рдирд╛ рд╣реИ рднрд╛рд░рдд рдЬреИрд╕реЗ рджреЗрд╢ рдореЗрдВ рдХрдИ рднрд╛рд╖рд╛рдПрдВ рдФрд░ рд╡рд┐рднрд┐рдиреНрди рдЙрдЪреНрдЪрд╛рд░рдг рдкрд╛рдП рдЬрд╛рддреЗ рд╣реИрдВ рдПрдХ рдордЬрдмреВрдд рдПрдПрд╕рдЖрд░ рдкреНрд░рдгрд╛рд▓реА рдХреЛ рд╡рд┐рднрд┐рдиреНрди рднрд╛рд╖рд╛рдУрдВ рдФрд░ рд╢реЛрд░рдпреБрдХреНрдд рд╡рд╛рддрд╛рд╡рд░рдг рдореЗрдВ рднреА рд╕рд╣реА рдкрд╣рдЪрд╛рди рдХрд░рдиреА рдЪрд╛рд╣рд┐рдП рдЗрд╕ рдСрдбрд┐рдпреЛ рдореЗрдВ рд╕реНрдкрд╖реНрдЯ рдЙрдЪреНрдЪрд╛рд░рдг рдФрд░ рдкреНрд░рд╛рдХреГрддрд┐рдХ рд╡рд╛рдХреНрдп рд╕рдВрд░рдЪрдирд╛ рд╢рд╛рдорд┐рд▓ рд╣реИ рдпрд╣ рдЦрдВрдб рдореЙрдбрд▓ рдХреА рдмрд╣реБрднрд╛рд╖реА рдХреНрд╖рдорддрд╛ рдХрд╛ рдкрд░реАрдХреНрд╖рдг рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рдЙрдкрдпреЛрдЧ рдХрд┐рдпрд╛ рдЬрд╛рдПрдЧрд╛

this is a synthetic voice generated for automatic speech recognition evaluation including generated speech allows us to compare human speech with artificial speech patterns some asr systems may perform better on synthetic audio because it has clearer pronunciation however real world human speech often contains natural pauses emotion and background noise this segment is included to test how the recognition model handles artificial speech

ok so this is an additional short segment added to ensure the total audio duration exceeds eight minutes in real world scenarios people often speak informally and sometimes change topics appropriately there may also be slight hesitations filler words like hmm or you know and catch your pace this helps simulate a more realistic meeting or conversational environment the purpose of this short addition is to satisfy the minimum duration requirement while maintaining natural speech characteristics


In [None]:
!ls -lh reference.txt


In [None]:
!pip install openai


In [None]:
from google.colab import userdata
OPENAI_API_KEY = userdata.get("OPENAI_API_KEY")
OPENAI_BASE_URL = userdata.get("OPENAI_BASE_URL")
from openai import OpenAI

client = OpenAI(
    api_key=OPENAI_API_KEY,
    base_url=OPENAI_BASE_URL
)


# 6.Whisper-1 Model Transcription

Whisper-1 is a transformer-based multilingual ASR model trained on diverse large-scale datasets.

Key features:
- Multilingual capability
- Robustness to accent variation
- Noise tolerance
- Fast inference speed

This section transcribes the complete audio file using Whisper-1.



In [None]:
import time

start_time = time.time()

with open("input_audio.wav", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file
    )

end_time = time.time()

whisper_text = transcript.text

with open("whisper_prediction.txt", "w") as f:
    f.write(whisper_text)

whisper_processing_time = end_time - start_time

print("Whisper transcription saved.")
print("Processing time (seconds):", whisper_processing_time)


In [None]:
!pip install transformers torchaudio jiwer


# 7.Wav2Vec2 Model Transcription

The second ASR system evaluated is:

facebook/wav2vec2-base-960h

Model characteristics:
- CTC-based architecture
- Primarily trained on English speech
- Not multilingual
- Sensitive to noise and accent variations

This model is used to compare multilingual robustness and efficiency.


In [None]:
import torch
import librosa
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC

processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

speech, sr = librosa.load("input_audio.wav", sr=16000)

start_time = time.time()

inputs = processor(speech, return_tensors="pt", sampling_rate=16000)

with torch.no_grad():
    logits = model(**inputs).logits

predicted_ids = torch.argmax(logits, dim=-1)
wav2vec_text = processor.decode(predicted_ids[0])

end_time = time.time()

with open("other_model_prediction.txt", "w") as f:
    f.write(wav2vec_text)

wav2vec_processing_time = end_time - start_time

print("Wav2Vec2 transcription saved.")
print("Processing time (seconds):", wav2vec_processing_time)


In [None]:
from jiwer import wer, cer

with open("reference.txt") as f:
    reference = f.read()

with open("whisper_prediction.txt") as f:
    whisper_pred = f.read()

with open("other_model_prediction.txt") as f:
    other_pred = f.read()


# 8.Evaluation Metrics

## Word Error Rate (WER)
WER = (Substitutions + Insertions + Deletions) / Total Words

Measures word-level transcription accuracy.

## Character Error Rate (CER)
Measures character-level transcription errors.

## Real-Time Factor (RTF)
RTF = Processing Time / Audio Duration

- RTF < 1 тЖТ Faster than real-time
- RTF > 1 тЖТ Slower than real-time

## Inverse RTF
Indicates how many times faster than real-time the model operates.


In [None]:
whisper_wer = wer(reference, whisper_pred)
whisper_cer = cer(reference, whisper_pred)

other_wer = wer(reference, other_pred)
other_cer = cer(reference, other_pred)

print("Whisper WER:", whisper_wer)
print("Whisper CER:", whisper_cer)

print("Wav2Vec2 WER:", other_wer)
print("Wav2Vec2 CER:", other_cer)


In [None]:
import librosa

y, sr = librosa.load("input_audio.wav", sr=None)
audio_duration = len(y) / sr

whisper_rtf = whisper_processing_time / audio_duration
wav2vec_rtf = wav2vec_processing_time / audio_duration

whisper_inverse_rtf = 1 / whisper_rtf
wav2vec_inverse_rtf = 1 / wav2vec_rtf

print("Whisper RTF:", whisper_rtf)
print("Whisper Inverse RTF:", whisper_inverse_rtf)

print("Wav2Vec2 RTF:", wav2vec_rtf)
print("Wav2Vec2 Inverse RTF:", wav2vec_inverse_rtf)


In [None]:
with open("results.txt", "w") as f:
    f.write("WHISPER RESULTS\n")
    f.write(f"WER: {whisper_wer}\n")
    f.write(f"CER: {whisper_cer}\n")
    f.write(f"RTF: {whisper_rtf}\n")
    f.write(f"Inverse RTF: {whisper_inverse_rtf}\n\n")

    f.write("WAV2VEC2 RESULTS\n")
    f.write(f"WER: {other_wer}\n")
    f.write(f"CER: {other_cer}\n")
    f.write(f"RTF: {wav2vec_rtf}\n")
    f.write(f"Inverse RTF: {wav2vec_inverse_rtf}\n")

print("results.txt created successfully.")


# 9.Model Comparison

| Metric | Whisper | Wav2Vec2 |
|--------|----------|------------|
| WER | 0.49 | 1.22 |
| CER | 0.27 | 0.87 |
| RTF | 0.031 | 1.43 |

Observations:

- Whisper significantly outperformed Wav2Vec2 in multilingual scenarios.
- Wav2Vec2 struggled with Tamil and Hindi speech.
- Whisper demonstrated better robustness to accent and noise.
- Whisper achieved real-time capable performance.


# 10.Results Summary

## Whisper-1
- WER: 0.49
- CER: 0.27
- RTF: 0.031
- 31├Ч faster than real-time

## Wav2Vec2
- WER: 1.22
- CER: 0.87
- RTF: 1.43
- Slower than real-time


# 11.Final Observations

1. Multilingual transformer-based ASR models outperform monolingual CTC-based models in real-world conditions.

2. Whisper handled:
   - Accent variations
   - Background noise
   - Multilingual speech
   more effectively than Wav2Vec2.

3. Wav2Vec2 failed significantly in non-English segments.

4. Whisper achieved real-time processing capability (RTF < 1).

5. Lower CER compared to WER indicates many errors were minor word mismatches rather than completely incorrect predictions.

## Conclusion

Whisper-1 is better suited for real-world multilingual ASR applications compared to Wav2Vec2.
