Skip to content

joshuaboniface/remote-faster-whisper

Repository files navigation

Remote Faster Whisper API

License Code style: Black Release

Remote Faster Whisper is a basic API designed to perform transcriptions of audio data with Faster Whisper over the network.

Our reference consumer is Kalliope, a Python virtual assistant tool. Normally, Kalliope would run on a low-power, low-cost device such as a Raspberry Pi. While Faster Whisper can run on such a device, it can take a prohibitively long time to process the speech into text, especially on older or non-overclocked devices or when requiring better than tiny accuracy. Remote Faster Whisper exists to offload this processing onto a much faster machine, ideally one with a CUDA-supporting GPU, to more quickly transcribe the audio and return it in a reasonable time. This can also enable a small collection of such devices to use a single central transcription server to avoid using a lot of power individually, while still keeping the STT self-hosted on-network. An example STT plugin for Kalliope is provided in the Kalliope folder.

Installation & Usage

To install Remote Faster Whisper, clone this repository to your system and run setup.sh as root (e.g. sudo ./setup.sh). You will be prompted for several configuration details, including the path to install it to, whether to install a service unit for it or not, and what user to run it as (for service deploys only). It will then install Remote Faster Whisper and the dependencies from requirements.txt inside a virtualenv in the specified path, (if chosen) install the systemd unit file into /etc/systemd/system, and then finally prompt you to edit the configuration file and start/enable the service. You can also perform these steps manually if you so choose.

Once running, you can HTTP POST binary WAV audio file data to the /api/v0/transcribe endpoint, and receive a JSON response of the transcription text and details. A simple test client is provided as send.py to validate a running instance with a local wav file.

Note: You must POST the audio data as files with the name audio_file as shown in the test client or the Kalliope STT example, and the data must be valid PCM WAV data (not FLAC, mp3, or any other formats).

The JSON response will look something like:

{'language': 'en', 'language_probability': 0.9578803181648254, 'runtime': 0.30777573585510254, 'sample_duration': 1.7763125, 'text': 'Hello world'}

Remote Faster Whisper is currently very sparse. It is not a real Python module or package, it runs as a Flask development server, and it uses the faster_whisper library directly (rather than a wrapper such as SpeechRecognition, though it does leverage some of that library's helper functions). These deficiencies may change in the future, and contributions are welcome.

Configuration Options

The configuration file config.yaml is divided into three main sections: daemon: controls the Flask API daemon itself; faster_whisper: controls the Faster Whisper transcription library; and transformations: which define transformations to make on the output text.

daemon -> listen

The IP address to listen on. Use 0.0.0.0 to listen on all interfaces.

daemon -> port

The port to listen on. We default to 9876 but this can be changed as desired to any high (>1024) port number.

daemon -> base_url

The base URL for the API. This defaults to /api/v0 but this can be changed to anything or an empty value if desired.

faster_whisper -> model_cache_dir

The directory to cache Faster Whisper models. We recommend a RAM disk (tmpfs) for this to improve performance, though any path can be used.

Remote Faster Whisper will attempt to download the model below at startup if this path is not found; this may take some time with slow network connections. This is done at startup, rather than during the first transcription to improve the user experience. If the directory exists but the model is missing, it will be downloaded when the first transcription occurs.

Note: When using a service install with a dynamic user (the default if no user is specified), this option must be set to a temporary directory (under /tmp or /var/tmp), and note that the model will be cached to an ephemeral directory valid only for the time the service is active. Thus the model will be re-downloaded each time the daemon starts. To avoid this, use a real user for the daemon, or use a pre-configured cache containing the model you wish to use outside of these temporary paths.

faster_whisper -> model

The model to use for transcribing. Can be any valid model that Faster Whisper supports.

faster_whisper -> device

The device to use for transcription processing. Can be one of auto, cpu, or cuda. Note that CUDA requires nVidia libraries to operate correctly; these should be installed by torch on supported systems by default.

faster_whisper -> device_index

The device index to use. Mostly relevant for cuda device support, to specify the GPU to use.

faster_whisper -> compute_type

The compute type to use; see the CTranslate2 documentation for details.

faster_whisper -> beam_size

The beam size for the transcriber to use. You should not ever need to change this unless you know why you need to.

faster_whisper -> translate

Whether or not to attempt translation on the incoming data to language (below). If false, the given language is always assumed. Leave as no if you plan to use a .en model.

faster_whisper -> language

The language to use, as a lowercase ISO language code (e.g. en, fr, zh, etc.). Leave empty (or remove) for automatic language selection.

transformations

This section is a list of tuple-lists, where the first element is a re.sub matching regex, and the second element is the replacement; e.g.

transformations:
  - ["(bunny|hare)", "rabbit"]
  - ["[Cc]ute", "fancy"]

After transcribing text with Faster Whisper, the text is run through these transformations in order, replacing the regex, if found in the text, with the corresponding string. Transformations build on each other, so a later transformation can alter the result of an earlier one.

For example, with the above transformations, speaking either "the cute bunny" or "The Cute hare" will actually return "the fancy rabbit".

This is a contrived example; the real reason to use transformations is to "fix up" common mishearings or misunderstandings in your environment.

As a more concrete example, you may say the phrase "lights on", but in your voice this is parsed as "light is on" or "light's on". As long as you don't expect "is on" to mean anything to your consumer, you could use a transformation here to force the "right" text, like:

transformations:
  - ["light is on", "lights on"]

This will ensure that the consumer gets something it expects even if the Whisper models don't quite understand you.

You could also generalize this a bit more and leverage the whitespace to your advantage:

transformations:
  - [" is on", "s on"]

This would replace both "light is on" with "lights on" as well as "speaker is on" with "speakers on", if both are common mishearings.

There are also 4 special transformations that can be used. These should be entered as simple list entries rather than a tuple-list.

  • lower will convert the entire string to lowercase with str.lower().

  • casefold will convert the entire string to full lowercase with str.casefold().

  • upper will convert the entire string to uppercase with str.upper().

  • title will convert the entire string to title-case with str.title().

Note: These special transformations are always applied first, before any other transformations, in the order given above. Using multiple special transformations is likely not very useful, but be mindful of this if you do.

Thus a full transformations example might look like:

transformations:
  - lower
  - ["[\\.,!?]", ""]  # Note the double-backslash for a literal '.'
  - [" is on", "s on"]
  - ["(keeter|peter)", "heater"]

This will ensure a fully-lowercase result, with no (common) punctuation, " is on" replaced by "s on", and "keeter" replaced with "heater"; hence speaking something that is transcribed as "Keeter is on." will return "heaters on".

Note: You should use this feature sparingly. A large number of transformations might slow down your transcription time considerably, and you must be mindful of the implications each transformation will have on all possible texts that are parsed. They work best with only a few common mishearings and when using relatively short text strings, for example in a voice command system.

Note: Regexes in the first field are normal strings, i.e. they are not treated as raw strings. Be mindful of complex regexes.

About

A basic HTTP API for handling Faster Whisper audio transcriptions over the network

Resources

License

Stars

Watchers

Forks

Releases

No releases published