Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 91 additions & 10 deletions metal/whisper/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ The Metal backend scripts are located under `metal/`
## Whisper Example

The Whisper example demonstrates how to:

1. Set up your environment
2. Export the Whisper model to ExecuTorch format
3. Build the Whisper Metal runner
Expand All @@ -37,7 +38,8 @@ metal/whisper/e2e.sh \
--setup-env \
--export \
--build \
--run --audio-path <audio_path>
--run --audio-path <audio_path> \
[--model-name <model_name>]
```

**Required Arguments:**
Expand All @@ -46,7 +48,13 @@ metal/whisper/e2e.sh \
- `<env_name>` - Name of the conda environment to create (e.g., `whisper-example`)
- `<audio_path>` - Path to your audio file for inference (e.g., `~/Desktop/audio.wav`)

**Example:**
**Optional Arguments:**

- `<model_name>` - HuggingFace Whisper model name (default: `openai/whisper-large-v3-turbo`)
- Available models: `openai/whisper-tiny`, `openai/whisper-base`, `openai/whisper-small`,
`openai/whisper-medium`, `openai/whisper-large`, `openai/whisper-large-v3-turbo`

**Example (default large-v3-turbo model):**

```bash
metal/whisper/e2e.sh \
Expand All @@ -60,11 +68,29 @@ metal/whisper/e2e.sh \
--run --audio-path ~/Desktop/audio.wav
```

**Example (using small model):**

```bash
metal/whisper/e2e.sh \
--artifact-dir ~/Desktop/whisper \
--env-name whisper-example \
--clone-et \
--create-env \
--setup-env \
--export \
--model-name openai/whisper-small \
--build \
--run --audio-path ~/Desktop/audio.wav
```

This will automatically:

1. Setup the environment:
- Clone the executorch repo
- Create a conda environment named `whisper-example`
- Install all dependencies

- Clone the executorch repo
- Create a conda environment named `whisper-example`
- Install all dependencies

2. Export the Whisper model to the `~/Desktop/whisper` directory
3. Build the Whisper Metal runner
4. Run inference on `~/Desktop/whisper/audio.wav`
Expand All @@ -86,14 +112,29 @@ conda activate <env_name>
#### Step 2: Export the Model

```bash
/path/to/metal/whisper/export.sh <artifact_dir>
/path/to/metal/whisper/export.sh <artifact_dir> [model_name]
```

**Arguments:**
- `<artifact_dir>` - Directory to store exported model files (e.g., `~/Desktop/whisper`)

- `<artifact_dir>` - Directory to store exported model files (required, e.g., `~/Desktop/whisper`)
- `[model_name]` - HuggingFace Whisper model name (optional, default: `openai/whisper-large-v3-turbo`)
- Available models: `openai/whisper-tiny`, `openai/whisper-base`, `openai/whisper-small`,
`openai/whisper-medium`, `openai/whisper-large`, `openai/whisper-large-v3-turbo`

**Examples:**

```bash
# Export default large-v3-turbo model
/path/to/metal/whisper/export.sh ~/Desktop/whisper

# Export small model
/path/to/metal/whisper/export.sh ~/Desktop/whisper openai/whisper-small
```

This will:
- Download the Whisper model

- Download the specified Whisper model from HuggingFace
- Export it to ExecuTorch format with Metal optimizations
- Save model files (`.pte`), metadata, and preprocessor to the specified directory

Expand All @@ -110,11 +151,51 @@ This will:
```

**Arguments:**
- `<audio_path>` - Path to your audio file (e.g., `/path/to/audio.wav`)
- `<artifact_dir>` - Directory containing exported model files (e.g., `~/Desktop/whisper`)

- `<audio_path>` - Path to your audio file (required, e.g., `/path/to/audio.wav`)
- `<artifact_dir>` - Directory containing exported model files (required, e.g., `~/Desktop/whisper`)

**Example:**

```bash
/path/to/metal/whisper/run.sh ~/Desktop/audio.wav ~/Desktop/whisper
```

This will:

- Validate that all required model files exist
- Load the model and preprocessor
- Run inference on the provided audio
- Display timing information

## Available Whisper Models

The following Whisper models are supported:

| Model Name | HuggingFace ID (for export) | Parameters | Mel Features | Relative Speed | Use Case |
| -------------- | ------------------------------- | ---------- | ------------ | -------------- | ------------------------------ |
| Tiny | `openai/whisper-tiny` | 39M | 80 | Fastest | Quick transcription, real-time |
| Base | `openai/whisper-base` | 74M | 80 | Very Fast | Good balance for real-time |
| Small | `openai/whisper-small` | 244M | 80 | Fast | Recommended for most use cases |
| Medium | `openai/whisper-medium` | 769M | 80 | Moderate | Higher accuracy needed |
| Large | `openai/whisper-large` | 1550M | 80 | Slower | Best accuracy |
| Large V3 | `openai/whisper-large-v3` | 1550M | **128** | Slower | Latest architecture |
| Large V3 Turbo | `openai/whisper-large-v3-turbo` | 809M | **128** | Fast | Default, good balance |

### Mel Features Configuration

The export scripts automatically configure the correct mel feature size based on the model:

- **80 mel features**: Used by all standard models (tiny, base, small, medium, large, large-v2)
- **128 mel features**: Used only by large-v3 and large-v3-turbo variants

**Important:** The preprocessor must match the model's expected feature size, or you'll encounter tensor shape mismatch errors. The export scripts handle this automatically.

### Tokenizer Configuration

**Important Note:** All Whisper models downloaded from HuggingFace now use the updated tokenizer format where:

- Token `50257` = `<|endoftext|>`
- Token `50258` = `<|startoftranscript|>` (used as `decoder_start_token_id`)

The whisper_runner automatically uses `decoder_start_token_id=50258` for all models, so you don't need to worry about tokenizer compatibility when exporting and running any Whisper variant.
10 changes: 9 additions & 1 deletion metal/whisper/e2e.sh
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ EXECUTORCH_PATH=""
ARTIFACT_DIR=""
ENV_NAME=""
AUDIO_PATH=""
MODEL_NAME="openai/whisper-large-v3-turbo"

echo "Current script path: $(realpath "$0")"
SCRIPT_DIR="$(realpath "$(dirname "$(realpath "$0")")")"
Expand All @@ -37,13 +38,15 @@ usage() {
echo " --create-env Create the Python environment"
echo " --setup-env Set up the Python environment"
echo " --export Export the Whisper model"
echo " --model-name NAME HuggingFace model name (optional, default: openai/whisper-large-v3-turbo)"
echo " --build Build the Whisper runner"
echo " --audio-path PATH Path to the input audio file"
echo " --run Run the Whisper model"
echo " -h, --help Display this help message"
echo ""
echo "Example:"
echo " $0 --env-name metal-backend --setup-env --export --build --audio-path audio.wav --run"
echo " $0 --env-name metal-backend --export --model-name openai/whisper-small --build --audio-path audio.wav --run"
exit 1
}

Expand Down Expand Up @@ -78,6 +81,10 @@ while [[ $# -gt 0 ]]; do
EXPORT=true
shift
;;
--model-name)
MODEL_NAME="$2"
shift 2
;;
--build)
BUILD=true
shift
Expand Down Expand Up @@ -160,8 +167,9 @@ fi
# Execute export
if [ "$EXPORT" = true ]; then
echo "Exporting Whisper model to $ARTIFACT_DIR ..."
echo " - Model: $MODEL_NAME"
echo " - Script: $SCRIPT_DIR/export.sh"
conda run -n "$ENV_NAME" "$SCRIPT_DIR/export.sh" "$ARTIFACT_DIR"
conda run -n "$ENV_NAME" "$SCRIPT_DIR/export.sh" "$ARTIFACT_DIR" "$MODEL_NAME"
fi

# Execute build
Expand Down
28 changes: 24 additions & 4 deletions metal/whisper/export.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,29 +6,49 @@
# LICENSE file in the root directory of this source tree.

ARTIFACT_DIR="$1"
MODEL_NAME="${2:-openai/whisper-large-v3-turbo}"

if [ -z "$ARTIFACT_DIR" ]; then
echo "Error: ARTIFACT_DIR argument not provided."
echo "Usage: $0 <ARTIFACT_DIR>"
echo "Usage: $0 <ARTIFACT_DIR> [MODEL_NAME]"
echo ""
echo "Arguments:"
echo " <ARTIFACT_DIR> Directory to store exported model files (required)"
echo " [MODEL_NAME] HuggingFace model name (optional, default: openai/whisper-large-v3-turbo)"
echo ""
echo "Example:"
echo " $0 ~/Desktop/whisper openai/whisper-small"
exit 1
fi

mkdir -p "$ARTIFACT_DIR"

echo "Exporting model: $MODEL_NAME"

# Determine feature_size based on model name
# large-v3 and large-v3-turbo use 128 mel features, all others use 80
if [[ "$MODEL_NAME" == *"large-v3"* ]]; then
FEATURE_SIZE=128
echo "Using feature_size=128 for large-v3/large-v3-turbo model"
else
FEATURE_SIZE=80
echo "Using feature_size=80 for standard Whisper model"
fi

optimum-cli export executorch \
--model "openai/whisper-large-v3-turbo" \
--model "$MODEL_NAME" \
--task "automatic-speech-recognition" \
--recipe "metal" \
--dtype bfloat16 \
--output_dir "$ARTIFACT_DIR"

python -m executorch.extension.audio.mel_spectrogram \
--feature_size 128 \
--feature_size $FEATURE_SIZE \
--stack_output \
--max_audio_len 300 \
--output_file "$ARTIFACT_DIR"/whisper_preprocessor.pte

TOKENIZER_URL="https://huggingface.co/openai/whisper-large-v3-turbo/resolve/main"
TOKENIZER_URL="https://huggingface.co/$MODEL_NAME/resolve/main"

curl -L $TOKENIZER_URL/tokenizer.json -o $ARTIFACT_DIR/tokenizer.json
curl -L $TOKENIZER_URL/tokenizer_config.json -o $ARTIFACT_DIR/tokenizer_config.json
Expand Down
1 change: 0 additions & 1 deletion metal/whisper/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -45,5 +45,4 @@ done
--tokenizer_path "$ARTIFACT_DIR"/ \
--audio_path "$INPUT_AUDIO_PATH" \
--processor_path "$ARTIFACT_DIR"/whisper_preprocessor.pte \
--model_name "large-v3-turbo" \
--temperature 0