Ruby-based web service for speech recognition, using the PocketSphinx gstreamer module
Switch branches/tags
Nothing to show
Pull request Compare This branch is 17 commits behind alumae:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.



Ruby-based web service for speech recognition, using the PocketSphinx gstreamer module.


  • Ruby 1.8

  • Sinatra

  • Rack

  • Unicorn

  • PocketSphinx (NOTE: some features of the server require patched PocketSphinx, see below)

  • Some acoustic and language models for PocketSphinx


CMU Sphinx

  • Install sphinxbase from SVN (make, make install)

Apply PocketSphinx patch

In cmusphinx/pocketsphinx directory:

patch  -p0 -i ps_gst.patch

Make sure you have GStreamer devevelopment packages installed. In Debian Squeeze:

apt-get install libgstreamer0.10-dev libgstreamer-plugins-base0.10-dev

And configure, make, make install as usual.

Install Ruby gems: Unicorn and Sinatra, UUID tools

This assumes you have ruby and rubygems installed.

You might want to do this as root:

gem install unicorn
gem install sinatra
gem install uuidtools
gem install json

Install ruby-gstreamer package (might vary depending on your distribution):

apt-get install libgst-ruby1.8

Run ruby-pocketsphinx-server

Clone the git repository:

git clone git://

Before executing, add `/usr/local/lib` to the path where GStreamer plugins are looked for:

export GST_PLUGIN_PATH=/usr/local/lib


unicorn -c unicorn.conf.rb

If you installed Unicorn as a Ruby gem, you might need to execute:

/var/lib/gems/1.8/bin/unicorn -c unicorn.conf.rb

Test the default configuration (English “turtle” LM), using a raw audio file in the PocketSphinx test directory.

curl -T $(POCKETSPHINX_DIR)/test/data/goforward.raw -H "Content-Type: audio/x-raw-int; rate=16000"  "http://localhost:8080/recognize"

Response should be:

  "status": 0,
  "hypotheses": [
      "utterance": "go forward ten meters"
  "id": "15c7a538d0d0c8d7f59e3cc791320953"


Web service

Unicorn configuration is in file unicorn.conf.rb. See for more info.


See conf.yaml

Using the web service

Some of the more advanced examples below are specific to the Estonian configuration.

Example 1

Record a sentence to a wav file, in mono (hit Ctrl-C when done speaking):

rec -c 1 sentence.wav

Send it to the web service:

curl   -X POST --data-binary @sentence.wav -H "Content-Type: audio/x-wav"  http://localhost:8080/recognize

Output (encoded using json, the example uses Estonian models):

  "status": 0,
  "hypotheses": [
      "utterance": [
        "t\u00e4na on v\u00e4ljas \u00fcsna ilus ilm"
  "id": "e30f54561135d681599915562d77d240"

Example 2

Record a raw file using arecord:

arecord --format=S16_LE  --file-type raw  --channels 1 --rate 16000 > sentence2.raw

Send it to web service:

curl -X POST --data-binary @sentence2.raw -H "Content-Type: audio/x-raw-int; rate=16000"  http://localhost:8080/recognize

Example 3

Record a 5 second audio, pipe it to curl, which streams it directly to web service using PUT (and gets almost instant response):

arecord --format=S16_LE --file-type raw --channels 1 --rate 16000 --duration 5 | curl -vv -T - -H "Content-Type: audio/x-raw-int; rate=16000"  http://localhost:8080/recognize

Support for JSGF grammars

Users can use their own grammars to recognize certain sentences. The grammars should be in JSGF format.

Example JSGF (let's call it robot.jsgf)

#JSGF V1.0;

grammar robot;

public <command> = (liigu | mine ) [ ( üks | kaks | kolm | neli | viis ) meetrit ] (edasi | tagasi);

NB! Grammars should be in the same charset that the server is using for dictionary, which currently is latin-1 (sorry for that).

You need to upload the JSGF file to somewhere where the server can fetch it, let's say

Now, let the server download and compile it:

curl -vv  http://localhost:8080/fetch-lm?url=

This should result in HTTP/1.1 200 OK.

Now you can use the grammar to recognize a sentence that is accepted by the grammar:

arecord --format=S16_LE --file-type raw --channels 1 --rate 16000 --duration 5 | \
curl -vv -T - -H "Content-Type: audio/x-raw-int; rate=16000"  http://localhost:8080/recognize?lm=


  "status": 0,
  "hypotheses": [
      "utterance": "mine viis meetrit tagasi"
  "id": "9e3895e9ee0b5138e73c6fca30f51a58"

If you update the grammar on the server, you need to make the /fetch-jsgf request again, as the server doesn't check for changes every time a recognition request is done (for efficiency reasons).

Support for GF grammars

GF (Grammatical Framework) grammars are supported.

A GF grammar must be compiled into a .pgf file. To upload it to the server, use the fetch-pgf API call, e.g.:

curl ""

The 'lang' attribute (defaults to 'Est') specifies input languages of the grammar. Many comma-separated languages can be specified, e.g lang=Est,Est2

To recognize with a GF, use similar request as with JSGF, e.g.:

arecord --format=S16_LE --file-type raw --channels 1 --rate 16000 --duration 5 | curl -vv -T - -H "Content-Type: audio/x-raw-int; rate=16000"  "http://localhost:8080/recognize?lm=

You can also specify output language(s) that will be used to linearize the raw recognition result, e.g.:

arecord --format=S16_LE --file-type raw --channels 1 --rate 16000 --duration 5 | curl -vv -T - -H "Content-Type: audio/x-raw-int; rate=16000"  "http://localhost:8080/recognize?lm="


 "status": 0,
 "hypotheses": [
     "utterance": "viis minutit sekundites",
     "linearizations": [
         "lang": "App",
         "output": "5 ' IN \""
         "lang": "App",
         "output": "5 min IN s"
 "id": "83486feaca30995401ed4a66951a3f23"

Multiple output languages can be used, by using comma-separated values: “..&output-lang=App,App2”