On-device Whisper inference on mobile (iPhone 13 Mini) #407

ggerganov · 2022-10-24T15:47:57Z

ggerganov
Oct 24, 2022

Hey again,

I recently posted here about my reimplementation of the model in C/C++ and yesterday I even got it running on mobile, so I thought that people would be interested in that as well. First, here is short video demonstration:

whisper-iphone-13-mini-2.mp4

This demo runs the base.en model, without internet connection on the device. I was pretty happy with the performance and honestly - quite surprised. Running the encoder part of the transformer on a single 30 seconds audio chunk takes about 1 second for the base model and about 3 seconds for the small model. The decoder part of course depends on the actual audio but it's typically faster compared to the encoder when using Greedy sampling.

The model implementation is in pure C/C++, wrapped in C-style API and called from the Objective-C application. I utilize NEON instructions + Accelerate framework for efficiency. The implementation [0] and this sample app [1] are open-source and available in the whisper.cpp repo.

I think this kind of performance allows for some real-world mobile application using Whisper - at least if you can afford to put the model data in your app. So far I have tested this only on iPhone 13 Mini, so it would be interesting if someone gives this a try on other iPhones and reports some results.

Also, wanted to say again that this Whisper model is very interesting to me and you guys at OpenAI have done a great job. Reimplementing this during the past few weeks was a very fun project and I learned quite a lot of stuff about transformers and linear algebra optimisations.
Thanks again!

[0] - https://github.com/ggerganov/whisper.cpp
[1] - https://github.com/ggerganov/whisper.cpp/tree/master/examples/whisper.objc

ldenoue · 2022-10-25T18:03:28Z

ldenoue
Oct 25, 2022

this is really nice. Do you have plans to run in macOS?

1 reply

ggerganov Oct 26, 2022
Author

Hi, this is just a demo application in Objective-C. You can build the library and run it from a command line on macOS - check the repo for more details.

bjnortier · 2022-10-31T16:57:07Z

bjnortier
Oct 31, 2022

@ggerganov I used your whisper.cpp to create an iOS SwiftUI transcription app - it's in the App Store:
https://apps.apple.com/za/app/hello-transcribe/id6443919768

12 replies

bjnortier Nov 21, 2022

The full code isn't open but I do have code for a demo app on iOS which is a bit rough but shows the basics:
https://github.com/bjnortier/whisper-ios-demo.

To do the outputs in bits and pieces you have to do some Swift/C++ gymnastics will a callback, which is in the repo I mentioned above.

angadhn Nov 21, 2022

Thanks! The link seems to not be working. Perhaps you mean this repo? https://github.com/bjnortier/whisper-tflite-ios

I will play around with this but, if I have questions, can I shoot you an email?

bjnortier Nov 21, 2022

Oh sorry it was private, I made it public.
The tflite one is a demo of the TensorFlow Lite implementation.
Sure ben@bjnortier.com

angadhn Nov 21, 2022

ah great. thanks! Will ping you an email, of course but this might be relevant generally to others playing with it- would it work with an m4a or is it mandatory to ba wav file?

bjnortier Nov 21, 2022

https://github.com/bjnortier/whisper-ios-demo has conversion built in (for format sAudioKit supports). Doesn't support Ogg. I have a version that supports Ogg waiting for approval from Apple.

geraldmd · 2022-11-11T05:20:08Z

geraldmd
Nov 11, 2022

Hey, curious how this project is going? What's the likelihood of getting something like this to provide real-time translation services on an IPhone? I work in the hospital with an large immigrant population and the translation services we use are SO painful to use. An app like that could be extremely lucrative as it would save physicians and nurses so much time each day...

8 replies

ldenoue Nov 11, 2022

Have you tried Otter? https://otter.ai/ I'm not sure but they might perform speech to text on device.

geraldmd Nov 11, 2022

@ldenoue Mostly looking for language translation services and would avoid interacting with any cloud or saving information locally to avoid HIPAA issues with patient sensitive information. If transcribed, only temporarily on the screen so I could read while talking to patients. Looks really cool though though, I'll have to check it out for personal use....

ggerganov Nov 27, 2022
Author

@geraldmd @bjnortier
I updated my sample app to support basic real-time transcription - here is a short demonstration:

whisper-iphone-13-mini-3.mp4

The source code is available - it's a quick and dirty implementation. Mostly to serve as a proof of concept.

I think an actual application would be best to download for example the base (142 MB) and small (460 MB) models from a server after installation. It will then use the base model for real-time transcription and the small model for retrospective refinement of the transcription, as demonstrated. This way the application size will be very small and it would still be able to load and run the small Whisper model.

bjnortier Nov 28, 2022

Awesome, I'll have a look. I've been dong experiments on a real-time version that uses the TensorflowLite model but will compare this.

angadhn Nov 29, 2022

@bjnortier I used your app and actually passed a very basic clip of something in Hindi (which it translated to English). Is this a weird artifact because I thought the tiny model was limited to English? Correct me if I am wrong, please @ggerganov

tiagoefreitas · 2022-11-11T10:26:30Z

tiagoefreitas
Nov 11, 2022

About using GPT-3 for translation to other languages than English, I don’t agree with a paid subscription, would rather pay for app itself and add my OpenAI API key to use their server directly, not some other server in the middle.

…

On Fri, 11 Nov 2022 at 10:15, Georgi Gerganov ***@***.***> wrote: @bjnortier <https://github.com/bjnortier> AFAIK in the medical industry, privacy of protected health information (PHI) is of great importance, so I believe you cannot simply use an application that uploads PHI somewhere in the Cloud. Or at least, it will be difficult to make it comply with regulations. Therefore, I imagine it would be useful to have a local translation solution that does not require internet connection. Additionally, my experience with Whisper is that the transcription accuracy is really high - I won't be surprised if it is actually the highest among the generally available ASR frameworks (I have no experience with other such software, so I could be wrong). If the translation accuracy is at the level of transcription, then probably a translation application for a mobile device could be actually useful. I've only played with translating Bulgarian and the quality is not that great, but maybe other languages could be better. @geraldmd <https://github.com/geraldmd> I will provide a proof-of-concept similar to the above for real-time translation on iPhone. I think it can easily run the base model and with some extra work - also the small model. But I don't have plans on making a full-blown app with all the necessary features. Maybe there would be interest in others to make a real app out of it. — Reply to this email directly, view it on GitHub <#407 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACXJNRUTR6OUHXLJ2CP3ZG3WHYMEXANCNFSM6AAAAAARNDTCD4> . You are receiving this because you commented.Message ID: ***@***.***>

0 replies

tituskx · 2022-11-15T19:57:03Z

tituskx
Nov 15, 2022

@ggerganov This is really cool, awesome job!! Question; do you know the improvement in inference time vs the original python implementation? I'm somewhat surprised that the difference is so extreme, as the pytorch library is written in C/C++. Wouldn't we expect most of the heavy lifting to be done by C/C++ already?

2 replies

ggerganov Nov 16, 2022
Author

@tituskx Thanks!
It's hard to provide specific numbers, because it depends on many factors (model, decoding strategy, audio contents, etc.). whisper.cpp currently implements only the Greedy sampling scheme so you have to compare against that. For example, on MacBook M1 Pro when I compare my implementation with whisper --best_of None --beam_size None input.wav the speed up is about x2 - x3 times for medium.en model. On x86 there is almost no difference with whisper.cpp being slightly faster.

My understanding is that PyTorch does not fully utilize F16 and AMX/Metal on Apple Silicon yet.
Maybe soon support will be added and then the difference will be not so dramatic.

If you are interested, I have been collecting some benchmarks for the Encoder part of the model on various hardware: ggerganov/whisper.cpp#89. This can indicate how the different CPUs compare against each other in terms of Whisper inference performance. It could be useful to extend it with PyTorch results to have broader comparison.

tituskx Nov 16, 2022

My understanding is that PyTorch does not fully utilize F16 and AMX/Metal on Apple Silicon yet.
Maybe soon support will be added and then the difference will be not so dramatic.

@ggerganov That makes sense, thanks for the quick reply. Sure I'll have a look, would be interesting to see how both compare! I'll stay tuned to any updates:)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On-device Whisper inference on mobile (iPhone 13 Mini) #407

{{title}}

Replies: 5 comments 23 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

On-device Whisper inference on mobile (iPhone 13 Mini) #407

Replies: 5 comments · 23 replies

ggerganov Oct 26, 2022 Author

ggerganov Nov 27, 2022 Author

ggerganov Nov 16, 2022 Author

Replies: 5 comments 23 replies

ggerganov Oct 26, 2022
Author

ggerganov Nov 27, 2022
Author

ggerganov Nov 16, 2022
Author