Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
90 lines (71 sloc) 4.32 KB

Web API

I am currently running the Irish standardizer and the gd2ga and gv2ga translators as a web service which powers several applications:

To use the API, simply make a HTTP POST request to the URL https://cadhan.com/api/intergaelic/3.0 with two parameters:

  • teacs: The source text to be translated, as URL-encoded UTF-8
  • foinse: The ISO 639-1 code of the source language ("ga", "gd" or "gv"). Specifying source language "ga" invokes the Irish standardizer. Currently, Irish (ga) is the only supported target language so it does not get specified as a parameter.

The parameters should be sent in the body of the request (not as part of the URL), and the request should specify Content-Type: application/x-www-form-urlencoded. See the various command-line clients for more details on how to construct proper API requests in your favorite language.

The response will be a JSON array of translation pairs. For example, if the value of the foinse parameter is "gd" (Scottish Gaelic), and the value of the teacs parameter is the following string (containing an embedded newline):

Agus thubhairt e,
"Iongantach!" an dèidh sin.

You should get the following response:

[["Agus","Agus"],["thubhairt","dúirt"],["e",""],[",",","],["\\n","\\n"],["\"","\""],["Iongantach","Iontach"],["!","!"],["\"","\""],["an dèidh sin","ina dhiaidh sin"],[".","."]]

How you process the JSON depends on the application you have in mind. If you are only interested in the target language translation, you can simply extract the second element of each pair and concatenate them together (there is a very simple detokenizer included in this repo). But since the languages we support are linguistically very close, in most cases we expect it to be more interesting and useful to use the translations as annotations of one kind or another on the source text, as was done with Intergaelic and the Twitter streams.

Having the full set of translation pairs may also make it easier to carry over any markup from the source text to the target text.

Details

  • Generally speaking, texts are tokenized into single words, but occasionally a translation pair will have more than one word on the source side, as in the example above (an dèidh sin). Similarly, there may be multiple words on the target side of a translation pair.
  • There is no guarantee that the number of words on the target side of a translation pair will be the same as the number on the source side. It is important to keep this in mind if designing an application that aligns source to target in some way.
  • The translator treats SGML markup, URLs, email addresses and so on as single tokens, and passes them through unchanged.
  • The web service supports CORS requests.

HTTP Response Codes

  • 200 (OK): Successful request
  • 400 (Bad Request): Missing parameter in request, unsupported source language, empty source text, source text not encoded as UTF-8, etc.
  • 403 (Forbidden): Request from unapproved IP address
  • 405 (Method Not Allowed): Only POST requests permitted
  • 413 (Payload Too Large): Request larger than 16k bytes
  • 500 (Internal Server Error): Translation server failed to process request

Rate Limits

Since these are pretty low-traffic web sites, I am not currently placing any rate limits on requests to the API. Individual requests are capped at 16k bytes. I would appreciate an email (kscanne at gmail) if you build something interesting or useful in any case, and especially so if you expect to be making many requests.