## Introduction to MLflow and OpenAI's Whisper

Discover the integration of [OpenAI's Whisper](https://huggingface.co/openai), an [ASR system](https://en.wikipedia.org/wiki/Speech_recognition), with MLflow in this tutorial.

### What You Will Learn in This Tutorial

- Establish an audio transcription **pipeline** using the Whisper model.
- **Log** and manage Whisper models with MLflow.
- Infer and understand Whisper model **signatures**.
- **Load** and interact with Whisper models stored in MLflow.
- Utilize MLflow's **pyfunc** for Whisper model serving and transcription tasks.

<details>
    <summary style="cursor: pointer; display: flex; align-items: center;">
        <span style="margin-right: 10px;">&#x25BA;</span>
        <span>Expand to learn more about Whisper and its integration with MLflow.</span>
    </summary>
    <br/>
    <div>
        <h4>What is Whisper?</h4>
        <p>Whisper, developed by OpenAI, is a versatile ASR model trained for high-accuracy speech-to-text conversion. It stands out due to its training on diverse accents and environments, available via the Transformers library for easy use.</p>
    </div>
    <div>
        <h4>Why MLflow with Whisper?</h4>
        <p>Integrating MLflow with Whisper enhances ASR model management:</p>
        <ul>
            <li><strong>Experiment Tracking</strong>: Facilitates tracking of model configurations and performance for optimal results.</li>
            <li><strong>Model Management</strong>: Centralizes different versions of Whisper models, enhancing organization and accessibility.</li>
            <li><strong>Reproducibility</strong>: Ensures consistency in transcriptions by tracking all components required for reproducing model behavior.</li>
            <li><strong>Deployment</strong>: Streamlines the deployment of Whisper models in various production settings, ensuring efficient application.</li>
        </ul>
    </div>
</details>
<br/>

Interested in learning more about Whisper? To read more about the significant breakthroughs in transcription capabilities that Whisper brought to the field of ASR, you can [read the white paper]("https://arxiv.org/abs/2212.04356") and see more about the active development and [read more about the progress]("https://openai.com/research/whisper") at OpenAI's research website.

Ready to enhance your speech-to-text capabilities? Let's explore automatic speech recognition using MLflow and Whisper!


### Setting Up the Environment and Acquiring Audio Data

Initial steps for transcription using [Whisper]("https://github.com/openai/whisper"): acquiring [audio]("https://www.nasa.gov/audio-and-ringtones/") and setting up MLflow.

<details>
    <summary style="cursor: pointer; display: flex; align-items: center;">
        <span style="margin-right: 10px;">&#x25BA;</span>
        <span>Expand for details on environment setup and audio data acquisition.</span>
    </summary>
    <br/>
    <div>
        <p>Before diving into the audio transcription process with OpenAI's Whisper, there are a few preparatory steps to ensure everything is in place for a smooth and effective transcription experience.</p>
    </div>
    <div>
        <h4>Audio Acquisition</h4>
        <p>The first step is to acquire an audio file to work with. For this tutorial, we use a publicly available audio file from NASA. This sample audio provides a practical example to demonstrate Whisper's transcription capabilities.</p>
    </div>
    <div>
        <h4>Model and Pipeline Initialization</h4>
        <p>We load the Whisper model, along with its tokenizer and feature extractor, from the Transformers library. These components are essential for processing the audio data and converting it into a format that the Whisper model can understand and transcribe.</p>
        <p>Next, we create a transcription pipeline using the Whisper model. This pipeline simplifies the process of feeding audio data into the model and obtaining the transcription.</p>
    </div>
    <div>
        <h4>MLflow Environment Setup</h4>
        <p>In addition to the model and audio data setup, we initialize our MLflow environment. MLflow is used to track and manage our experiments, offering an organized way to document the transcription process and results.</p>
    </div>
    <div>
        <p>The following code block covers these initial setup steps, providing the foundation for our audio transcription task with the Whisper model.</p>
    </div>
</details>
<br/>


In [1]:
import requests
import transformers

import mlflow


# Acquire an audio file that is in the public domain
resp = requests.get(
    "https://www.nasa.gov/wp-content/uploads/2015/01/590325main_ringtone_kennedy_WeChoose.mp3"
)
resp.raise_for_status()
audio = resp.content

task = "automatic-speech-recognition"
architecture = "openai/whisper-large-v3"

model = transformers.WhisperForConditionalGeneration.from_pretrained(architecture)
tokenizer = transformers.WhisperTokenizer.from_pretrained(architecture)
feature_extractor = transformers.WhisperFeatureExtractor.from_pretrained(architecture)
model.generation_config.alignment_heads = [[2, 2], [3, 0], [3, 2], [3, 3], [3, 4], [3, 5]]
audio_transcription_pipeline = transformers.pipeline(
    task=task, model=model, tokenizer=tokenizer, feature_extractor=feature_extractor
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### Formatting the Transcription Output

In this section, we introduce a utility function that is used solely for the purpose of enhancing the readability of the transcription output within this Jupyter notebook demo. It is important to note that this function is designed for demonstration purposes and should not be included in production code or used for any other purpose beyond this tutorial.

The `format_transcription` function takes a long string of transcribed text and formats it by splitting it into sentences and inserting newline characters. This makes the output easier to read when printed in the notebook environment.


In [2]:
def format_transcription(transcription):
    """
    Function for formatting a long string by splitting into sentences and adding newlines.
    """
    # Split the transcription into sentences, ensuring we don't split on abbreviations or initials
    sentences = [
        sentence.strip() + ("." if not sentence.endswith(".") else "")
        for sentence in transcription.split(". ")
        if sentence
    ]

    # Join the sentences with a newline character
    formatted_text = "\n".join(sentences)

    return formatted_text

### Executing the Transcription Pipeline

Perform audio transcription using the Whisper pipeline and review the output.

<details>
    <summary style="cursor: pointer; display: flex; align-items: center;">
        <span style="margin-right: 10px;">&#x25BA;</span>
        <span>Expand to learn about the transcription process and its significance.</span>
    </summary>
    <br/>
    <div>
        <p>After setting up the Whisper model and audio transcription pipeline, our next step is to process an audio file to extract its transcription. This part of the tutorial is crucial as it demonstrates the practical application of the Whisper model in converting spoken language into written text.</p>
    </div>
    <div>
        <h4>Transcription Process</h4>
        <p>The code block below feeds an audio file into the pipeline, which then produces the transcription. The <code>format_transcription</code> function, defined earlier, enhances readability by formatting the output with sentence splits and newline characters.</p>
    </div>
    <div>
        <h4>Importance of Pre-Save Testing</h4>
        <p>Testing the transcription pipeline before saving the model in MLflow is vital. This step verifies that the model works as expected, ensuring accuracy and reliability. Such validation avoids issues post-deployment and confirms that the model performs consistently with the training data it was exposed to. It also provides a benchmark to compare against the output after the model is loaded back from MLflow, ensuring consistency in performance.</p>
    </div>
    <div>
        <p>Execute the following code to transcribe the audio and assess the quality and accuracy of the transcription provided by the Whisper model.</p>
    </div>
</details>
<br/>


In [3]:
transcription = audio_transcription_pipeline(audio)

print(format_transcription(transcription["text"]))

We choose to go to the moon in this decade and do the other things.
Not because they are easy, but because they are hard.
3, 2, 1, 0.
All engines running.
Liftoff.
We have a liftoff.
32 minutes past the hour.
Liftoff on Apollo 11.


### Model Signature and Configuration

Generate a model signature for Whisper to understand its input and output data requirements.

<details>
    <summary style="cursor: pointer; display: flex; align-items: center;">
        <span style="margin-right: 10px;">&#x25BA;</span>
        <span>Expand to explore details on model signature and configuration.</span>
    </summary>
    <br/>
    <div>
        <p>The model signature is critical for defining the schema for the Whisper model's inputs and outputs, clarifying the data types and structures expected. This step ensures the model processes inputs correctly and outputs structured data.</p>
    </div>
    <div>
        <h4>Handling Different Audio Formats</h4>
        <p>While the default signature covers binary audio data, the <code>transformers</code> flavor accommodates multiple formats, including numpy arrays and URL-based inputs. This flexibility allows Whisper to transcribe from various sources, although URL-based transcription isn't demonstrated here.</p>
    </div>
    <div>
        <h4>Model Configuration</h4>
        <p>Setting the model configuration involves parameters like <i>chunk</i> and <i>stride</i> lengths for audio processing. These settings are adjustable to suit different transcription needs, enhancing Whisper's performance for specific scenarios.</p>
    </div>
    <div>
        <p>Run the next code block to infer the model's signature and configure key parameters, aligning Whisper's functionality with your project's requirements.</p>
    </div>
</details>
<br/>


In [4]:
model_config = {
    "chunk_length_s": 20,
    "stride_length_s": [5, 3],
}

signature = mlflow.models.infer_signature(
    audio,
    mlflow.transformers.generate_signature_output(audio_transcription_pipeline, audio),
    params=model_config,
)

### Setting the tracking server and creating an experiment

In order to view the results in our tracking server (for the purposes of this tutorial, we've started a local tracking server at this url)

We can start an instance of the MLflow server locally by running the following from a terminal to start the tracking server:

``` bash
    mlflow server --host 127.0.0.1 --port 8080
```

With the server started, the following code will ensure that all experiments, runs, models, parameters, and metrics that we log are being tracked within that server instance (which also provides us with the MLflow UI when navigating to that url address in a browser).

After setting the tracking url, we create a new MLflow Experiment to store the run we're about to create in. 

In [5]:
mlflow.set_tracking_uri("http://127.0.0.1:8080")

mlflow.set_experiment("WhisperTranscription")

<Experiment: artifact_location='mlflow-artifacts:/224567153691341564', creation_time=1699495264421, experiment_id='224567153691341564', last_update_time=1699495264421, lifecycle_stage='active', name='WhisperTranscription', tags={}>

### Logging the Model with MLflow

Learn how to log the Whisper model and its configurations with MLflow.

<details>
    <summary style="cursor: pointer; display: flex; align-items: center;">
        <span style="margin-right: 10px;">&#x25BA;</span>
        <span>Expand to learn about the process of logging models in MLflow.</span>
    </summary>
    <br/>
    <div>
        <p>Logging the Whisper model in MLflow is a critical step for capturing essential information for model reproduction, sharing, and deployment. This process involves:</p>
    </div>
    <div>
        <h4>Key Components of Model Logging</h4>
        <ul>
            <li><strong>Model Information</strong>: Includes the model, its signature, and an input example.</li>
            <li><strong>Model Configuration</strong>: Any specific parameters set for the model, like <i>chunk length</i> or <i>stride length</i>.</li>
        </ul>
    </div>
    <div>
        <h4>Using MLflow's <code>log_model</code> Function</h4>
        <p>This function is utilized within an MLflow run to log the model and its configurations. It ensures that all necessary components for model usage are recorded.</p>
    </div>
    <div>
        <p>Executing the code in the next cell will log the Whisper model in the current MLflow experiment. This includes storing the model in a specified artifact path and documenting the default configurations that will be applied during inference.</p>
    </div>
</details>
<br/>


In [6]:
# Log the pipeline
with mlflow.start_run():
    model_info = mlflow.transformers.log_model(
        transformers_model=audio_transcription_pipeline,
        artifact_path="whisper_transcriber",
        signature=signature,
        input_example=audio,
        model_config=model_config,
    )



### Loading and Using the Model Pipeline

Explore how to load and use the Whisper model pipeline from MLflow.

<details>
    <summary style="cursor: pointer; display: flex; align-items: center;">
        <span style="margin-right: 10px;">&#x25BA;</span>
        <span>Expand to learn about loading and using the logged model pipeline.</span>
    </summary>
    <br/>
    <div>
        <p>After logging the Whisper model in MLflow, the next crucial step is to load and use it for inference. This process ensures that our logged model operates as intended and can be effectively used for tasks like audio transcription.</p>
    </div>
    <div>
        <h4>Loading the Model</h4>
        <p>The model is loaded in its native format using MLflow's <code>load_model</code> function. This step verifies that the model can be retrieved and used seamlessly after being logged in MLflow.</p>
    </div>
    <div>
        <h4>Using the Loaded Model</h4>
        <p>Once loaded, the model is ready for inference. We demonstrate this by passing an MP3 audio file to the model and obtaining its transcription. This test is a practical demonstration of the model's capabilities post-logging.</p>
    </div>
    <div>
        <p>This step is a form of validation before moving to more complex deployment scenarios. Ensuring that the model functions correctly in its native format helps in troubleshooting and streamlines the deployment process, especially for large and complex models like Whisper.</p>
    </div>
</details>
<br/>


In [7]:
# Load the pipeline in its native format
loaded_transcriber = mlflow.transformers.load_model(model_uri=model_info.model_uri)

transcription = loaded_transcriber(audio)

print(f"\nWhisper native output transcription:\n{format_transcription(transcription['text'])}")

Downloading artifacts:   0%|          | 0/29 [00:00<?, ?it/s]

2023/11/08 22:53:55 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false
2023/11/08 22:54:33 INFO mlflow.transformers: 'runs:/a77b9f3c037948228dd24787e33f91b4/whisper_transcriber' resolved as 'mlflow-artifacts:/224567153691341564/a77b9f3c037948228dd24787e33f91b4/artifacts/whisper_transcriber'


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/13 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.



Whisper native output transcription:
We choose to go to the moon in this decade and do the other things.
Not because they are easy, but because they are hard.
3, 2, 1, 0.
All engines running.
Liftoff.
We have a liftoff.
32 minutes past the hour.
Liftoff on Apollo 11.


### Using the Pyfunc Flavor for Inference

Learn how MLflow's `pyfunc` flavor facilitates flexible model deployment.

<details>
    <summary style="cursor: pointer; display: flex; align-items: center;">
        <span style="margin-right: 10px;">&#x25BA;</span>
        <span>Expand to discover the use of <code>pyfunc</code> for model inference.</span>
    </summary>
    <br/>
    <div>
        <p>MLflow's <code>pyfunc</code> flavor provides a generic interface for model inference, offering flexibility across various machine learning frameworks and deployment environments. This feature is beneficial for deploying models where the original framework may not be available, or a more adaptable interface is required.</p>
    </div>
    <div>
        <h4>Loading and Predicting with Pyfunc</h4>
        <p>The code below illustrates how to load the Whisper model as a <code>pyfunc</code> and use it for prediction. This method highlights MLflow's capability to adapt and deploy models in diverse scenarios.</p>
    </div>
    <div>
        <h4>Output Format Considerations</h4>
        <p>Note the difference in the output format when using <code>pyfunc</code> compared to the native format. The <code>pyfunc</code> output conforms to standard pyfunc output signatures, typically represented as a <code>List[str]</code> type, aligning with broader MLflow standards for model outputs.</p>
    </div>
</details>
<br/>


In [8]:
pyfunc_transcriber = mlflow.pyfunc.load_model(model_uri=model_info.model_uri)

pyfunc_transcription = pyfunc_transcriber.predict([audio])

# Note: the pyfunc return type if `return_timestamps` is set is a JSON encoded string.
print(f"\nPyfunc output transcription:\n{format_transcription(pyfunc_transcription[0])}")

Downloading artifacts:   0%|          | 0/29 [00:00<?, ?it/s]

2023/11/08 22:54:45 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false


Loading checkpoint shards:   0%|          | 0/13 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.



Pyfunc output transcription:
We choose to go to the moon in this decade and do the other things.
Not because they are easy, but because they are hard.
3, 2, 1, 0.
All engines running.
Liftoff.
We have a liftoff.
32 minutes past the hour.
Liftoff on Apollo 11.


### Tutorial Roundup

Throughout this tutorial, we've explored how to:

- Set up an audio transcription pipeline using the OpenAI Whisper model.
- Format and prepare audio data for transcription.
- Log, load, and use the model with MLflow, leveraging both the native and pyfunc flavors for inference.
- Format the output for readability and practical use in a Jupyter Notebook environment.

We've seen the benefits of using MLflow for managing the machine learning lifecycle, including experiment tracking, model versioning, reproducibility, and deployment. By integrating MLflow with the Transformers library, we've streamlined the process of working with state-of-the-art NLP models, making it easier to track, manage, and deploy cutting-edge NLP applications.