# Using Hugging Face DLC to Host the Whisper Model for Automatic Speech Recognition Tasks


❗This notebook works well with the `PyTorch 2.0.0 Python 3.10 CPU Optimized` kernel on a SageMaker Studio `ml.t3.medium` instance.

## Setup

Before executing the notebook, there are some initial steps required for set up.

In [None]:
%%sh
pip install -Uq pip
pip install -Uq sagemaker>=2.221.1
pip install -Uq datasets==2.16.1
pip install -Uq soundfile==0.12.1
pip install -Uq librosa==0.10.2.post1
pip install -Uq soundfile==0.12.1

In [None]:
!pip freeze | grep -E "datasets|librosa|sagemaker|soundfile|torch"

datasets==2.16.1
librosa==0.10.2.post1
sagemaker==2.221.1
sagemaker-experiments==0.1.43
sagemaker-pytorch-training==2.8.0
sagemaker-training==4.5.0
smdebug @ file:///tmp/sagemaker-debugger
soundfile==0.12.1
torch==2.0.0
torchaudio==2.0.1
torchdata @ file:///opt/conda/conda-bld/torchdata_1679615656247/work
torchtext==0.15.1
torchvision==0.15.1


## Dowonload a test data sample from Hugging Face dataset

In [None]:
import soundfile as sf
from datasets import load_dataset

dataset = load_dataset('MLCommons/peoples_speech', split='train', streaming=True, trust_remote_code=True)
sample = next(iter(dataset))

In [None]:
audio_data = sample['audio']['array']
audio_path = 'sample_audio.wav'
sf.write(audio_path, audio_data, sample['audio']['sampling_rate'])

print(f"Audio sample saved to '{audio_path}'")

In [None]:
import boto3
import sagemaker

aws_region = boto3.Session().region_name
sess = sagemaker.session.Session()
role = sagemaker.get_execution_role()

bucket = sess.default_bucket()
prefix = 'openai-whisper'

## Real-time Inference

In [None]:
import boto3
from typing import List


def get_cfn_outputs(stackname: str, region_name: str) -> List:
    cfn = boto3.client('cloudformation', region_name=region_name)
    outputs = {}
    for output in cfn.describe_stacks(StackName=stackname)['Stacks'][0]['Outputs']:
        outputs[output['OutputKey']] = output['OutputValue']
    return outputs

In [None]:
CFN_STACK_NAME = "ASRHuggingFaceRealtimeEndpointStack"
cfn_stack_outputs = get_cfn_outputs(CFN_STACK_NAME, aws_region)

endpoint_name = cfn_stack_outputs['EndpointName']

In [None]:
from sagemaker import Predictor
from sagemaker.serializers import DataSerializer
from sagemaker.deserializers import JSONDeserializer


audio_serializer = DataSerializer(content_type="audio/x-audio")
deserializer = JSONDeserializer()

predictor = Predictor(
    endpoint_name=endpoint_name,
    serializer=audio_serializer,
    deserializer=deserializer
)

In [None]:
%%time
# Perform real-time inference

try:
    response = predictor.predict(data=audio_path)
    print(response)
except Exception as ex:
    print(ex)

{'text': " I wanted to share a few things, but I'm going to not share as much as I wanted to share because we are starting late. I'd like to get this thing going so we can all get home at a decent hour. This election is very important to us."}


In [None]:
# Perform real-time inference

try:
    initial_args = {'ContentType': 'audio/wav'}
    response = predictor.predict(
        initial_args=initial_args,
        data=audio_path
    )

    print(response)
except Exception as ex:
    print(ex)

An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
  "code": 400,
  "type": "InternalServerException",
  "message": "Content type audio/wav is not supported by this framework.\n\n            Please implement input_fn to to deserialize the request data or an output_fn to\n            serialize the response. For more information, see the SageMaker Python SDK README."
}
". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/openai-whisper-medium-9105237 in account 123456789012 for more information.


In [None]:
%%time

# Perform real-time inference

initial_args = {'ContentType': 'audio/wave'}
response = predictor.predict(
    initial_args=initial_args,
    data=audio_path
)

response

CPU times: user 8.43 ms, sys: 0 ns, total: 8.43 ms
Wall time: 1.26 s


{'text': " I wanted to share a few things, but I'm going to not share as much as I wanted to share because we are starting late. I'd like to get this thing going so we can all get home at a decent hour. This election is very important to us."}

In [None]:
%%time

with open(audio_path, "rb") as file:
    wav_file_read = file.read()

response = predictor.predict(
    data=wav_file_read
)

response

CPU times: user 8.4 ms, sys: 233 µs, total: 8.63 ms
Wall time: 1.3 s


{'text': " I wanted to share a few things, but I'm going to not share as much as I wanted to share because we are starting late. I'd like to get this thing going so we can all get home at a decent hour. This election is very important to us."}

## References

- [(AWS Blog) Host the Whisper Model on Amazon SageMaker: exploring inference options (2024-01-16)](https://aws.amazon.com/blogs/machine-learning/host-the-whisper-model-on-amazon-sagemaker-exploring-inference-options/)
- [(Example Jupyter Notebooks) Using Huggingface DLC to Host the Whisper Model for Automatic Speech Recognition Tasks](https://github.com/aws-samples/amazon-sagemaker-host-and-inference-whisper-model/blob/main/huggingface/huggingface.ipynb)