## [Github Repo](https://github.com/ml-explore/mlx-examples)

## [HF Community](https://huggingface.co/mlx-community)

---

In [4]:
!pip install datasets -q
!pip install soundfile -q

In [None]:
!pip install pydub -q
!pip install mlx-whisper -q
!pip install huggingface_hub -q

In [44]:
!pip install jiwer -q

---

## Loading some data from a german ASR dataset

In [12]:
from datasets import load_dataset

# Load the dataset in streaming mode
dataset = load_dataset('flozi00/german-canary-asr-0324', split='train', streaming=True)

# Initialize an iterator
iterator = iter(dataset)

# Fetch the first 10 rows and store them in a list of dictionaries
first_10_rows = []
for _ in range(10):
    row = next(iterator)
    first_10_rows.append({
        'audio': row['audio'],  # Assuming the audio column is named 'audio'
        'text1': row['transkription'],  # Assuming the first text column is named 'text1'
        'text2': row['source']   # Assuming the second text column is named 'text2'
    })

""" # Print the list of dictionaries
for entry in first_10_rows:
    print(entry) """


' # Print the list of dictionaries\nfor entry in first_10_rows:\n    print(entry) '

In [2]:
from pydub import AudioSegment
import tempfile
from IPython.display import Audio
import mlx_whisper
import os

  from .autonotebook import tqdm as notebook_tqdm


In [13]:
# Extract the audio byte string
audio_bytes = first_10_rows[1]['audio']['bytes']

# Create a temporary file to save the audio
with tempfile.NamedTemporaryFile(delete=False, suffix=".mp3") as temp_audio_file:
    temp_audio_filename = temp_audio_file.name
    temp_audio_file.write(audio_bytes)

# Optional Play the audio in the Jupyter notebook
Audio(temp_audio_filename)

# Temp audio file is also the file to be transcribed by the ASR model.


---

## Function to increase volume. But whisper doesn't need it (yet?)

In [26]:
import numpy as np
from scipy.io import wavfile

def increase_volume(input_file, output_file, volume_factor):
    # Read the wav file
    sample_rate, data = wavfile.read(input_file)
    
    # Convert to float for processing
    data_float = data.astype(float)
    
    # Increase the volume by multiplying
    data_increased = data_float * volume_factor
    
    # Clip values to avoid distortion
    data_increased = np.clip(data_increased, -32768, 32767)
    
    # Convert back to int16
    data_increased = data_increased.astype(np.int16)
    
    # Save the modified file
    wavfile.write(output_file, sample_rate, data_increased)

# Example usage
input_wav = "sample1.wav"
output_wav = f"{input_wav[:-4]}_increased.wav"
volume_factor = 10.0  # Double the volume. Adjust between 1.0-4.0 for different levels

increase_volume(input_wav, output_wav, volume_factor)

---

## Transcribing audio

[Whisper Collection](https://huggingface.co/collections/mlx-community/whisper-663256f9964fbb1177db93dc)

## Sample 1

**Bereits gedownloaded** | **Average Transcription Time (30s)** | **Word Error Rate**
--- | --- | ---
mlx-community/whisper-tiny-mlx | 0.5 | 0.66
mlx-community/whisper-base-mlx | 0.91 | 0.47
mlx-community/whisper-small-mlx | 1.8 | 0.37
kein *medium* model auf HF gefunden | |
mlx-community/whisper-large-v3-mlx | 6.8 | 0.16
mlx-community/whisper-large-v3-turbo | 2.8 | 0.16

## Sample 2

**Bereits gedownloaded** | **Average Transcription Time (30s)** | **Word Error Rate**
--- | --- | ---
mlx-community/whisper-tiny-mlx | 0.4s | 0.38
mlx-community/whisper-base-mlx | 0.8s | 0.21
mlx-community/whisper-small-mlx | 1.8s | 0.28
kein *medium* model auf HF gefunden | |
mlx-community/whisper-large-v3-mlx | 9.8s | 0.11
mlx-community/whisper-large-v3-turbo | 5.3s | 0.11

In [64]:
import mlx_whisper
import time
from jiwer import wer

audio_name = "sample1"
audio_file = f"audio_samples/{audio_name}.wav"
path_transcript = f"audio_samples/{audio_name}_transcript.txt"

with open(path_transcript, "r") as file:
    reference_transcript = file.read()

number_of_transcriptions = 10
transcription_times = []
word_errors = []

for n in range(number_of_transcriptions):
    # Set timer
    start = time.time()
    # Transcribe the audio
    result = mlx_whisper.transcribe(
        audio_file,
        path_or_hf_repo="mlx-community/whisper-tiny-mlx",
        language = "de",
        #verbose = True,
        #word_timestamps=True,
    )
    print(f"Transcription time {n+1}: {time.time()-start:.2f} seconds")
    # Store the time taken
    transcription_times.append(time.time()-start)

    print(result['text'])
    error = wer(reference_transcript , result['text'])
    print(f"Word Error Rate {n+1}: {error}")
    word_errors.append(error)

    
# Calculate the average time taken
average_time = sum(transcription_times) / number_of_transcriptions
print(f"Average transcription time: {average_time:.2f} seconds")

# Calculate average time without the first
average_time = sum(transcription_times[1:]) / (number_of_transcriptions-1)
print(f"Average transcription time (excluding the first): {average_time:.2f} seconds")

# Calculate the average WER
average_wer = sum(word_errors) / number_of_transcriptions
print(f"Average Word Error Rate: {average_wer:.2f}")

Fetching 4 files: 100%|██████████| 4/4 [00:00<00:00, 70492.50it/s]


Transcription time 1: 0.90 seconds
 und ich habe die Schulden freimer an. Und ich habe überlegt, ob ich mich ein sprechen, das Schuldenanteils ein Kaufeln das Unternehmen. Langfristig finde ich echt stark, wenn es des Helfhorses wäre, am besten aus Systeme mit Zwollar und zwischen Speichern. Das war richtig krass.
Word Error Rate 1: 0.6578947368421053
Transcription time 2: 0.42 seconds
 und ich habe die Schulden freimer an. Und ich habe überlegt, ob ich mich ein sprechen, das Schuldenanteils ein Kaufeln das Unternehmen. Langfristig finde ich echt stark, wenn es des Helfhorses wäre, am besten aus Systeme mit Zwollar und zwischen Speichern. Das war richtig krass.
Word Error Rate 2: 0.6578947368421053
Transcription time 3: 0.40 seconds
 und ich habe die Schulden freimer an. Und ich habe überlegt, ob ich mich ein sprechen, das Schuldenanteils ein Kaufeln das Unternehmen. Langfristig finde ich echt stark, wenn es des Helfhorses wäre, am besten aus Systeme mit Zwollar und zwischen Speichern.

In [37]:
print(result.keys())

dict_keys(['text', 'segments', 'language'])


---

## Word Error Rate (Plain)

In [54]:
def levenshtein_distance(s1, s2):
    """Calculate the Levenshtein distance between two strings of words"""
    if len(s1) < len(s2):
        return levenshtein_distance(s2, s1)
    if len(s2) == 0:
        return len(s1)
    
    previous_row = range(len(s2) + 1)
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + 1
            deletions = current_row[j] + 1
            substitutions = previous_row[j] + (c1 != c2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row
    
    return previous_row[-1]

# Original text words
with open("audio_samples/sample2_transcript.txt", "r") as file:
    original = file.read().split()

whisper_transcript = result['text'].split()

# Calculate Levenshtein distance
distance = levenshtein_distance(original, whisper_transcript)

# Calculate WER
wer = distance / len(original)

# Find differences
differences = []
for i, (orig_word, trans_word) in enumerate(zip(original, whisper_transcript)):
    if orig_word != trans_word:
        differences.append(f"Original: '{orig_word}' -> Transcript: '{trans_word}'")

print(f"Number of words in original: {len(original)}")
print(f"Number of words in transcript: {len(whisper_transcript)}")
print(f"Levenshtein distance: {distance}")
print(f"Word Error Rate (WER): {wer:.4f}")
print("\nKey differences:")
for diff in differences:
    print(diff)

Number of words in original: 47
Number of words in transcript: 47
Levenshtein distance: 5
Word Error Rate (WER): 0.1064

Key differences:
Original: 'ähnlichen' -> Transcript: 'ähnlich'
Original: 'App' -> Transcript: 'App,'
Original: 'wohlfühle.' -> Transcript: 'wohlfühle'
Original: 'Und' -> Transcript: 'und'
Original: 'ich' -> Transcript: 'ich...'
