# Speech-to-Text Robustness Analysis Using Whisper

## An Experimental Study on Audio Format and Noise Variations

This notebook explores how OpenAI's Whisper model performs under
different audio formats and distortion conditions.

We conduct controlled experiments using:

- Real human voice recording
- Multiple audio formats (WAV, MP3, low bitrate)
- Telephone-quality audio
- Background noise (moderate and heavy)
- Echo effects
- Word Error Rate (WER) analysis
- Inference time comparison

The objective is to analyze how signal distortion affects
automatic speech recognition performance.


## Objective

The goal of this experiment is to:

1. Record real human speech.
2. Transcribe it using Whisper.
3. Modify the audio under different conditions.
4. Compare transcription accuracy.
5. Measure Word Error Rate (WER).
6. Evaluate model robustness.

This study demonstrates how audio degradation impacts
speech recognition systems.


## Step 1: Recording Real Voice Input

In this section, we record a 2-line spoken sentence using the browser microphone.

Sentence spoken:

> "Machine learning is transforming healthcare.  
> Early diagnosis improves patient survival rates."

This serves as the baseline audio sample for all further experiments.


In [None]:
from IPython.display import Audio
from google.colab import output
from base64 import b64decode

RECORD_SECONDS = 10

print("Recording... Speak now.")

js = """
const sleep  = time => new Promise(resolve => setTimeout(resolve, time))
const b2text = blob => new Promise(resolve => {
  const reader = new FileReader()
  reader.onloadend = e => resolve(e.srcElement.result)
  reader.readAsDataURL(blob)
})
var record = async () => {
  const stream = await navigator.mediaDevices.getUserMedia({audio:true})
  const recorder = new MediaRecorder(stream)
  let chunks = []
  recorder.ondataavailable = e => chunks.push(e.data)
  recorder.start()
  await sleep(%d * 1000)
  recorder.stop()
  await sleep(1000)
  const blob = new Blob(chunks)
  const text = await b2text(blob)
  return text
}
record()
""" % RECORD_SECONDS

audio_data = output.eval_js(js)
binary = b64decode(audio_data.split(',')[1])

with open('my_voice.webm', 'wb') as f:
    f.write(binary)

print("Saved as my_voice.webm")


## Step 2: Converting Recorded Audio to WAV Format

The recorded file is saved as `.webm`.

Whisper performs best with WAV format, so we convert the
audio file into `.wav` using FFmpeg.

This ensures compatibility and stable preprocessing.


In [None]:
!apt update && apt install -y ffmpeg


In [None]:
!ffmpeg -i my_voice.webm my_voice.wav


## Step 3: Installing and Testing Whisper (Base Model)

We install OpenAI Whisper and test the transcription
using the Base model.

The Base model has moderate parameter size and
serves as an initial evaluation baseline.


In [None]:
!pip install -U openai-whisper
!apt update && apt install -y ffmpeg


In [None]:
import whisper

model = whisper.load_model("base")

result = model.transcribe(
    "my_voice.wav",
    language="en",
    temperature=0,
    fp16=False
)

print("BASE OUTPUT:")
print(result["text"])


##  Step 4: Testing with a Larger Model (Medium)

We now load the Medium model, which has significantly
more parameters and better language modeling capability.

This helps us observe how model size affects
transcription accuracy.


In [None]:
model = whisper.load_model("medium")

result = model.transcribe(
    "my_voice.wav",
    language="en",
    temperature=0,
    fp16=False
)

print("MEDIUM MODEL OUTPUT:")
print(result["text"])


##  Step 5: Speed Variation (Fast Speech)

We artificially increase the playback speed of the audio
using the `atempo` filter.

This tests Whisper’s ability to handle rapid speech
and temporal distortion.


In [None]:
!ffmpeg -i my_voice.wav -filter:a "atempo=1.8" fast_voice.wav


In [None]:
result = model.transcribe(
    "fast_voice.wav",
    language="en",
    temperature=0,
    fp16=False
)

print("FAST VERSION OUTPUT:")
print(result["text"])


## Step 6: Audio Format Variations

We now test how different audio formats affect transcription:

- Standard MP3
- Low bitrate MP3 (high compression)

This helps analyze whether file compression impacts
speech recognition performance.


In [None]:
result = model.transcribe(
    "my_voice.wav",
    language="en",
    temperature=0,
    fp16=False
)

print("BASELINE:")
print(result["text"])


In [None]:
!ffmpeg -i my_voice.wav my_voice.mp3


In [None]:
result = model.transcribe(
    "my_voice.mp3",
    language="en",
    temperature=0,
    fp16=False
)

print("MP3 OUTPUT:")
print(result["text"])


In [None]:
!ffmpeg -i my_voice.wav -b:a 32k low_bitrate.mp3


In [None]:
result = model.transcribe(
    "low_bitrate.mp3",
    language="en",
    temperature=0,
    fp16=False
)

print("LOW BITRATE OUTPUT:")
print(result["text"])


##  Step 7: Telephone-Quality Audio Simulation

We simulate a phone call environment by:

- Reducing sampling rate to 8kHz
- Applying bandpass filtering (300Hz – 3400Hz)

This mimics real-world telecommunication audio
and tests model robustness under limited frequency range.


In [None]:
!ffmpeg -i my_voice.wav -ar 8000 -af "highpass=f=300, lowpass=f=3400" telephone.wav


In [None]:
result = model.transcribe(
    "telephone.wav",
    language="en",
    temperature=0,
    fp16=False
)

print("TELEPHONE OUTPUT:")
print(result["text"])


##  Step 8: Background Noise Simulation

We introduce controlled white noise at two levels:

1. Moderate Noise
2. Heavy Noise

This evaluates how phoneme distortion affects
transcription accuracy.


In [None]:
!ffmpeg -i my_voice.wav -filter_complex "anoisesrc=color=white:amplitude=0.2 [noise]; [0:a][noise] amix=inputs=2:duration=shortest" moderate_noise.wav


In [None]:
!ffmpeg -i my_voice.wav -filter_complex "anoisesrc=color=white:amplitude=0.6 [noise]; [0:a][noise] amix=inputs=2:duration=shortest" heavy_noise.wav


## Step 9: Echo Effect Simulation

We apply an echo filter to simulate room reverberation.

Echo introduces temporal overlap in audio signals
and challenges the model’s temporal understanding.


In [None]:
!ffmpeg -i my_voice.wav -filter:a "aecho=0.8:0.9:1000:0.3" echo.wav


In [None]:
!pip install jiwer


##  Step 10: Word Error Rate (WER) Calculation

To quantitatively evaluate performance,
we calculate Word Error Rate (WER).

WER measures the difference between:

- Reference sentence (ground truth)
- Transcribed output

WER = (Substitutions + Insertions + Deletions) / Total Words

Lower WER indicates better accuracy.


In [None]:
reference = "Machine learning is transforming healthcare. Early diagnosis improves patient survival rates."


In [None]:
import time
from jiwer import wer

files = [
    "my_voice.wav",
    "my_voice.mp3",
    "low_bitrate.mp3",
    "telephone.wav",
    "moderate_noise.wav",
    "heavy_noise.wav",
    "echo.wav"
]

results = []

for f in files:
    start = time.time()

    result = model.transcribe(
        f,
        language="en",
        temperature=0,
        fp16=False
    )

    end = time.time()

    transcription = result["text"]
    error = wer(reference.lower(), transcription.lower())
    inference_time = round(end - start, 2)

    results.append((f, transcription, error, inference_time))

for r in results:
    print("\n==============================")
    print("File:", r[0])
    print("Transcription:", r[1])
    print("WER:", r[2])
    print("Inference Time (s):", r[3])


##  Step 11: Performance Visualization

We organize the results into a structured DataFrame
and visualize WER across different audio conditions.

This provides a clear comparison of model robustness
under varying signal distortions.


In [None]:
import pandas as pd

df = pd.DataFrame(results, columns=["File", "Transcription", "WER", "Inference Time (s)"])
df


In [None]:
import matplotlib.pyplot as plt

plt.figure()
plt.bar(df["File"], df["WER"])
plt.xticks(rotation=45)
plt.ylabel("Word Error Rate")
plt.title("WER Across Audio Conditions")
plt.show()


## Observations

1. The Medium model significantly outperformed the Base model.
2. Simple format changes (WAV vs MP3) had minimal impact.
3. Low bitrate compression slightly affected clarity.
4. Telephone-quality audio reduced frequency information but remained understandable.
5. Moderate noise caused minor degradation.
6. Heavy noise significantly increased WER.
7. Echo introduced temporal confusion but did not completely fail transcription.

Key Insight:
Audio signal distortion impacts transcription more than file format differences.


## Conclusion

This experiment demonstrates that:

- Whisper is highly robust to format variations.
- Model size plays a crucial role in handling unclear audio.
- Severe background noise significantly impacts performance.
- Band-limited audio (telephone quality) still performs reasonably well.
- Word Error Rate (WER) is an effective quantitative metric.

Overall, Whisper shows strong real-world robustness
but remains sensitive to extreme acoustic distortions.
