<a href="https://colab.research.google.com/github/iupui-soic/openemr-ai/blob/main/wlargev3turbo_k_ds.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Whisper large V3 turbo (Word error rate)
This file has code to evaluate the performance of whisper large v3 turbo for audio translation across a dataset of audiofiles by calculating the word error rate (WER).

**Prerequisite:**

To run this code, you need to have kaggle API credentials available as JSON file.

**Link to the dataset:**
https://www.kaggle.com/datasets/paultimothymooney/medical-speech-transcription-and-intent

**Link to whisper web:**
https://huggingface.co/openai/whisper-large-v3-turbo

In [None]:
!pip install --upgrade pip
!pip install --upgrade transformers datasets[audio] accelerate

Collecting pip
  Downloading pip-25.3-py3-none-any.whl.metadata (4.7 kB)
Downloading pip-25.3-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m25.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-25.3
Collecting datasets[audio]
  Downloading datasets-4.3.0-py3-none-any.whl.metadata (18 kB)
Collecting pyarrow>=21.0.0 (from datasets[audio])
  Downloading pyarrow-22.0.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.2 kB)
Collecting torchcodec>=0.6.0 (from datasets[audio])
  Downloading torchcodec-0.8.1-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (9.7 kB)
Downloading datasets-4.3.0-py3-none-any.whl (506 kB)
Downloading pyarrow-22.0.0-cp312-cp312-manylinux_2_28_x86_64.whl (47.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0

In [None]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

In [None]:
#Install kaggle
!pip install kaggle --quiet

In [None]:
#upload kaggle.json
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"vishnupriyakkkk","key":"ff1e5ae21de53b395814a0ee9f0bed90"}'}

In [None]:
#create kaggle directory
!mkdir ~/.kaggle

In [None]:
#copy kaggle.json file
!cp kaggle.json ~/.kaggle/

In [None]:
ls -ltr ~/.kaggle/

total 4
-rw-r--r-- 1 root root 71 Oct 29 04:11 kaggle.json


In [None]:
#setting right permission
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
!kaggle datasets list -s 'Medical Speech, Transcription, and Intent'

ref                                                        title                                            size  lastUpdated                 downloadCount  voteCount  usabilityRating  
---------------------------------------------------------  -----------------------------------------  ----------  --------------------------  -------------  ---------  ---------------  
paultimothymooney/medical-speech-transcription-and-intent  Medical Speech, Transcription, and Intent  5654300742  2019-02-24 04:37:00.727000           6970        135  0.7058824        
sj4canada/medical-speech-transcription-and-intent          Medical Speech, Transcription, and Intent  2826870688  2025-03-06 20:28:31.067000              4          0  0.23529412       
sj4canada/medical-speech-transcription-and-intent-2        Medical Speech Transcription and Intent 2   531276294  2025-03-06 21:45:05.947000             11          0  0.23529412       
randhumonous/testing-data                                  testing dat

In [None]:
!kaggle datasets download -d "paultimothymooney/medical-speech-transcription-and-intent" -p ./data/

Dataset URL: https://www.kaggle.com/datasets/paultimothymooney/medical-speech-transcription-and-intent
License(s): other
Downloading medical-speech-transcription-and-intent.zip to ./data
 99% 5.20G/5.27G [01:37<00:03, 17.4MB/s]
100% 5.27G/5.27G [01:37<00:00, 58.0MB/s]


In [None]:
# Unzip the dataset
!unzip -q ./data/medical-speech-transcription-and-intent.zip -d ./data/

print("✓ Unzipped!")

✓ Unzipped!


In [None]:
import os

# Check the full directory structure
print("=== Full Directory Structure ===\n")

import os

print("=== Full Directory Structure ===\n")

for root, dirs, files in os.walk('./data/Medical Speech, Transcription, and Intent'):
    print(f"{root}/")
    for file in files[:2]:
        print(f"  {file}")


    if len(files) > 2:
        print(f"  ... and {len(files) - 2} more files")

    print()

=== Full Directory Structure ===

=== Full Directory Structure ===

./data/Medical Speech, Transcription, and Intent/
  overview-of-recordings.csv

./data/Medical Speech, Transcription, and Intent/recordings/

./data/Medical Speech, Transcription, and Intent/recordings/train/
  1249120_44220382_64354562.wav
  1249120_44246595_26959468.wav
  ... and 379 more files

./data/Medical Speech, Transcription, and Intent/recordings/validate/
  1249120_44323331_82659146.wav
  1249120_44263136_107636900.wav
  ... and 383 more files

./data/Medical Speech, Transcription, and Intent/recordings/test/
  1249120_43788827_107201794.wav
  1249120_43855932_50139479.wav
  ... and 5893 more files



In [None]:
import pandas as pd

csv_path = './data/Medical Speech, Transcription, and Intent/overview-of-recordings.csv'
df = pd.read_csv(csv_path)

print(f"Columns ({len(df.columns)}): {list(df.columns)}")
print(f"Total rows: {len(df)}")
print("\nFirst 5 rows:")
print(df.head())
print("\nData types:")
print(df.dtypes)

Columns (13): ['audio_clipping', 'audio_clipping:confidence', 'background_noise_audible', 'background_noise_audible:confidence', 'overall_quality_of_the_audio', 'quiet_speaker', 'quiet_speaker:confidence', 'speaker_id', 'file_download', 'file_name', 'phrase', 'prompt', 'writer_id']
Total rows: 6661

First 5 rows:
   audio_clipping  audio_clipping:confidence background_noise_audible  \
0     no_clipping                     1.0000              light_noise   
1  light_clipping                     0.6803                 no_noise   
2     no_clipping                     1.0000                 no_noise   
3     no_clipping                     1.0000              light_noise   
4     no_clipping                     1.0000                 no_noise   

   background_noise_audible:confidence  overall_quality_of_the_audio  \
0                               1.0000                          3.33   
1                               0.6803                          3.33   
2                             

In [None]:
# Step 1: Load original CSV with transcripts
csv_path = './data/Medical Speech, Transcription, and Intent/overview-of-recordings.csv'
df_original = pd.read_csv(csv_path)

print(f"Loaded CSV: {len(df_original)} rows")
print(f"Columns: {list(df_original.columns)}\n")

Loaded CSV: 6661 rows
Columns: ['audio_clipping', 'audio_clipping:confidence', 'background_noise_audible', 'background_noise_audible:confidence', 'overall_quality_of_the_audio', 'quiet_speaker', 'quiet_speaker:confidence', 'speaker_id', 'file_download', 'file_name', 'phrase', 'prompt', 'writer_id']



In [None]:
# Step 2: Get all audio files from VALIDATE directory
validate_dir = './data/Medical Speech, Transcription, and Intent/recordings/validate'

audio_files = []
for root, dirs, files in os.walk(validate_dir):
    for file in files:
        if file.endswith('.wav'):
            audio_files.append({
                'file_name': file,
                'filepath': os.path.join(root, file)
            })

print(f"Found {len(audio_files)} audio files in validate directory\n")

Found 385 audio files in validate directory



In [None]:
# Step 3: Create DataFrame with audio files
df_audio = pd.DataFrame(audio_files)

In [None]:
# Step 4: Match with original transcripts using file_name
df_merged = df_audio.merge(
    df_original[['file_name', 'phrase', 'prompt', 'speaker_id']],
    on='file_name',
    how='left'
)

print(f"Matched {len(df_merged)} files with original transcripts")
print(f"Files without transcripts: {df_merged['phrase'].isna().sum()}\n")
print("First 5 matched files:")
print(df_merged[['file_name', 'phrase', 'prompt']].head())

Matched 385 files with original transcripts
Files without transcripts: 0

First 5 matched files:
                        file_name  \
0   1249120_44323331_82659146.wav   
1  1249120_44263136_107636900.wav   
2   1249120_44323331_10095808.wav   
3   1249120_44263136_31123906.wav   
4   1249120_44292353_25851801.wav   

                                              phrase              prompt  
0                   When I walk it's hard to breath.      Hard to breath  
1  I may have overdone it with the weightlifting,...  Injury from sports  
2                  There is a sharp pain in my bicep         Muscle pain  
3  I get big patches of irritated pimples on my b...                Acne  
4  When I stand up too quickly I start to feel di...       Feeling dizzy  


In [None]:
# Step 5: Setup Whisper model
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

print(f"\nUsing device: {device}")

model_id = "openai/whisper-large-v3-turbo"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    chunk_length_s=30,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
)

# Step 6: Transcribe all validate files
print("\n=== Starting Transcription ===")
transcripts = []

for idx, row in df_merged.iterrows():
    filename = row['file_name']
    filepath = row['filepath']

    print(f"{idx+1}/{len(df_merged)}: {filename}")

    try:
        result = pipe(filepath, return_timestamps=True)
        transcript = result['text']
        transcripts.append(transcript)
        print(f"Whisper: {transcript[:60]}...")
        print(f"Original: {row['phrase'][:60]}...")

    except Exception as e:
        print(f"Error: {e}")
        transcripts.append("")

# Step 7: Add Whisper transcripts to dataframe
df_merged['wlv3t_transcript'] = transcripts

# Rename phrase column for clarity
df_merged.rename(columns={'phrase': 'ori_script'}, inplace=True)

# Step 8: Save results
output_path = 'validate_transcripts_with_wer.csv'
df_merged.to_csv(output_path, index=False)

print(f"\n✓ Saved to: {output_path}")
print(f"Columns: {list(df_merged.columns)}")
print("\nFirst 3 results:")
print(df_merged[['file_name', 'ori_script', 'wlv3t_transcript']].head(3))



Using device: cuda:0


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/1.62G [00:00<?, ?B/s]

generation_config.json: 0.00B [00:00, ?B/s]

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

normalizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!
Device set to use cuda:0



=== Starting Transcription ===
1/385: 1249120_44323331_82659146.wav


Using custom `forced_decoder_ids` from the (generation) config. This is deprecated in favor of the `task` and `language` flags/config options.
Transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English. This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`. See https://github.com/huggingface/transformers/pull/28687 for more details.


Whisper:  when I walk it's hard to breathe...
Original: When I walk it's hard to breath....
2/385: 1249120_44263136_107636900.wav
Whisper:  I may have overdone it with the weight lifting because I am...
Original: I may have overdone it with the weightlifting, because I am ...
3/385: 1249120_44323331_10095808.wav
Whisper:  there is a sharp pain in my B cap...
Original: There is a sharp pain in my bicep...
4/385: 1249120_44263136_31123906.wav
Whisper:  I get bit patches of irritated pimples on my back and they ...
Original: I get big patches of irritated pimples on my back and they h...
5/385: 1249120_44292353_25851801.wav
Whisper:  When I stand up too quickly, I start to feel dizzy and ligh...
Original: When I stand up too quickly I start to feel dizzy and light-...
6/385: 1249120_44273314_96091387.wav
Whisper:  I feel that there is great pain in my left shoulder....
Original: i feel that is great pain in my left shoulder...
7/385: 1249120_44294866_20041027.wav
Whisper:  I am disappoint

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Whisper:  I got a divorce last year, and I just can't stop dwelling o...
Original: I got a divorce last year and I just can't stop dwelling on ...
11/385: 1249120_44292353_67132028.wav
Whisper:  I feel weak all over....
Original: I feel weak all over....
12/385: 1249120_44323331_81893062.wav
Whisper:  I feel pain in my ears with tinnitus...
Original: I feel pain in my ears with tinnitus...
13/385: 1249120_44294866_105332708.wav
Whisper:  I don't have full range of motion with my arms...
Original: I don't have full range of motion with my arms...
14/385: 1249120_44259428_108477882.wav
Whisper:  When it hurts about the world, that it tastes eating more t...
Original: My knees hurt so bad to walk that I stay sitting more than I...
15/385: 1249120_44323331_16590901.wav
Whisper:  I feel Sarvar itching in the skin with redness....
Original: I feel severe itching in the skin with redness...
16/385: 1249120_44294866_104890103.wav
Whisper:  I have a hard muscle pain since I went to the gym....


Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.


Whisper:  I was the way I was the first day I was the first day I was...
Original: i was injured during football match, i was diagnosed with Cr...
251/385: 1249120_44273314_38073414.wav
Whisper:  I feel like I fell in hot water....
Original: I feel like l fell in hot water...
252/385: 1249120_44273314_31545094.wav
Whisper:  The joints in my fingers are painful in the morning....
Original: The joints in my fingers are painful in the morning....
253/385: 1249120_44273314_70741506.wav
Whisper:  I notice a lot more hair coming off than usual when I brush...
Original: I notice a lot more hair coming out than usual when I brush ...
254/385: 1249120_44263136_52222173.wav
Whisper:  Every time I make an effort, I felt dizzy....
Original: Every time I make an effort, I felt dizzy....
255/385: 1249120_44263136_10880734.wav
Whisper:  I had internal pain and gasses when I ate Indian spicy food...
Original: I had internal pain and gases when I ate indian spicy food y...
256/385: 1249120_44259428_882

In [None]:
!pip install jiwer
import jiwer
import pandas as pd
from pathlib import Path

# Path to your CSV
csv_path = Path.cwd() / "validate_transcripts_with_wer.csv"

# Load CSV
df = pd.read_csv(csv_path)

print(f"Loaded {len(df)} rows")
print(f"Columns: {list(df.columns)}")


transformation = jiwer.Compose([
    jiwer.RemovePunctuation(),
    jiwer.ToLowerCase(),
    jiwer.RemoveMultipleSpaces(),
    jiwer.Strip()
])

refs = [transformation(str(r)) for r in df["ori_script"]]
hyps = [transformation(str(h)) for h in df["wlv3t_transcript"]]


wer_scores = []
for i in range(len(refs)):
    wer_score = jiwer.wer(refs[i], hyps[i])
    wer_scores.append(wer_score)

#Add WER scores to dataframe
df["wer"] = wer_scores

#Calculate average WER
average_wer = sum(wer_scores) / len(wer_scores)

print("\n=== RESULTS ===")
print(f"Model: openai/whisper-large-v3-turbo")

print(f"\nFirst 5 WER scores:")
for idx, row in df.head(5).iterrows():
    print(f"  {row['file_name']}: {row['wer']:.4f}")

print(f"\nAverage WER: {average_wer:.4f}")

#Save updated CSV with WER column
output_path = "validate_transcripts_with_wer_final.csv"
df.to_csv(output_path, index=False)

print(f"\n✓ Saved CSV with WER scores to: {output_path}")
print(f"Columns: {list(df.columns)}")

#Step 7: Display summary table
print("\n=== SUMMARY TABLE (first 10 rows) ===")
print(df[["file_name", "wer"]].head(10))

Collecting jiwer
  Downloading jiwer-4.0.0-py3-none-any.whl.metadata (3.3 kB)
Collecting rapidfuzz>=3.9.7 (from jiwer)
  Downloading rapidfuzz-3.14.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (12 kB)
Downloading jiwer-4.0.0-py3-none-any.whl (23 kB)
Downloading rapidfuzz-3.14.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (3.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m49.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz, jiwer
Successfully installed jiwer-4.0.0 rapidfuzz-3.14.1
Loaded 385 rows
Columns: ['file_name', 'filepath', 'ori_script', 'prompt', 'speaker_id', 'wlv3t_transcript']

=== RESULTS ===
Model: openai/whisper-large-v3-turbo

First 5 WER scores:
  1249120_44323331_82659146.wav: 0.1429
  1249120_44263136_107636900.wav: 0.0952
  1249120_44323331_10095808.wav: 0.2500
  1249120_44263136_31123906.wav: 0.1538
  1249120_44292353_25851801.wav: 0.0000

Average WER: 0.23

In [None]:
import pandas as pd
import jiwer
from pathlib import Path
#Load the CSV file
csv_path = Path.cwd() / "validate_transcripts_with_wer.csv"
df = pd.read_csv(csv_path)

#Define normalization transformation
transformation = jiwer.Compose([
    jiwer.RemovePunctuation(),
    jiwer.ToLowerCase(),
    jiwer.RemoveMultipleSpaces(),
    jiwer.Strip()
])

#Create alignment visualizations
alignment_visualizations = []

for idx, row in df.iterrows():
    ref = transformation(str(row["ori_script"]))
    hyp = transformation(str(row["wlv3t_transcript"]))

    #Generate word-level alignment
    out = jiwer.process_words([ref], [hyp])
    alignment_str = jiwer.visualize_alignment(out)

    alignment_visualizations.append(alignment_str)

#Add alignment visualization column to dataframe
df["alignment_visualization"] = alignment_visualizations

# Save updated CSV
output_path = Path.cwd() / "wer_whisperlargev3_error_map.csv"
df.to_csv(output_path, index=False)

print("✅ Saved updated CSV with alignment visualizations to:", output_path)

✅ Saved updated CSV with alignment visualizations to: /content/wer_whisperlargev3_error_map.csv
