Merge pull request #881 from xunkar/patch-1

Update ai-service.md
libretro · Oct 15, 2023 · d038542 · d038542
2 parents 3120d1f + a1914da
commit d038542
Showing 1 changed file with 59 additions and 30 deletions.
diff --git a/docs/guides/ai-service.md b/docs/guides/ai-service.md
@@ -1,58 +1,87 @@
-# RetroArch AI Service
-
-<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/wJvbxurnzPg" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
-
 ## What is the AI Service
 
-This feature allows users to play games written in a foreign language, or add text voice-overs automatically.  This uses OCR (optical character recognition), machine translation, and text-to-speech.  While these technologies can't provide the same level of accuracy as curated content, it can go quite far.  Machine translation can give a good gist of what's being said, especially for some language pairs, and text-to-speech can be of great benefit for accessibility.
+This feature allows users to capture the current state of the game and feed it to a customizable endpoint for additional processing. With the help of OCR (optical character recognition) and other techniques, the AI service can provide a live translation of a game, or text-to-speech capabilities for the visually impaired among other things, either on demand or automatically.
 
 ## How it works
 
-When a user presses the AI Service hotkey, RetroArch will grab the screen of the game being played and send it to the service endpoint listed in the configuration.  When the service returns, RetroArch will then either write the translated image to the screen or say the text, depending on the configuration.
-
-The main supported service to use is the vgtranslate project: ( https://gitlab.com/spherebeaker/vgtranslate ).  This project is a python server you can run locally (or on your network) and uses Google Cloud OCR, and Google Text-to-Speech APIs with the Google Cloud keys you provide.
-
-Other supported services are in the Alternative Services section.
+When a user presses the AI Service hotkey, RetroArch will grab the screen of the game being played and send it to the service endpoint listed in the configuration. When the service returns, RetroArch will display the results according to the configuration. Pressing the AI Service hotkey again will clear any content currently displayed.
 
 ## How to set it up
 
 First, go to Settings->Input->Hotkey Binds, and assign a key for the AI Service.
 
 Next, go to Settings->AI Service and modify the configuration options as follows.
 
-If `AI Service Output` is `Image Mode`, then when you press the AI Service hotkey, RetroArch will grab the game screen and display translated textbox over the screen when it's available, and pressing the hotkey again will clear the translated textbox.  If `AI Service Pause Toggle` is `ON` then it will pause while doing so, and pressing the AI Service hotkey again will unpause the game and continue as normal.  This mode is recommended when you want the text on the screen to be translated and written back on the screen where it was found.  When `AI Service Output` is `Speech Mode`, then RetroArch speak the text on the screen instead, and will use an AI text-to-speech algorithm.  This speech however will only play while the core is unpaused (for technical reasons).  However, if in `Narrator Mode` the AI Service will use the OS's narrator, which will speak the text even while unpaused (assuming that accessibility is enabled).  However, `Narrator Mode` is only supported on Windows, Linux, and Macos.
+`AI Service Enabled` should be set to `ON`.
 
-The `AI Service URL` points to where the translation service you're using is located.  In the case of running vgtranslate yourself, this URL would be `http://localhost:4404`.  For more instructions on how to set up vgtranslate on your system, see ( https://gitlab.com/spherebeaker/vgtranslate/blob/master/README.md ).  If you have issues running vgtranslate, or don't want to run the service yourself, see the "Alternative Services" section.
+The `AI Service URL` is the URL of the AI service that you want to use. For example, `http://localhost:4404` for a service running locally on your computer and listening on port 4404. Check the documentation of the 3rd party AI service you're using to find out what this URL should be. (see "Known Services" below for some examples).
 
-`AI Service Enabled` should be set to on.
+`AI Service Output` controls how the processed content is displayed on your end. Naturally, your selection should match whatever capabilities the AI service you have configured offers. Note that `Image Mode` requires widgets to be enabled (Settings->On-Screen Display->On-Screen Notifications->Graphics Widgets).
 
-If `Source Language` is set to `Don't care`, then the service will attempt to auto-detect the language on the screen.  Setting it to a specific language will increase accuracy, and restrict translation to only text in the source language specified.
+- In `Image Mode` the AI service is expected to return an image that will be overlaid on top of the game feed. This mode can be used to draw information on the screen, like writing a translation over the original text box. 
+- In `Narrator Mode` the AI service is expected to return text that will be spoken on the user's machine using native text-to-speech capabilities, like the Windows narrator.
+- In `Speech Mode` the AI service is expected to return an audio file. This mode can be used as an alternative to the `Narrator Mode` if the user's machine is unable to use native text-to-speech, relying on the service to provide the actual audio.
+- In `Text Mode` the AI service is expected to return a text that will be displayed on top of the screen, like subtitles.
+- A combined mode like `Text + Narrator` works as you would expect, providing both the result as text on screen and text-to-speech.
 
-Finally, `Target Language` is the language to translate into.  If set to `Don't care` then it will translate into English.  See the vgtranslate docs for more information.
+`Pause During Translation` will pause the core as soon as the user presses the AI Service hotkey and display whatever content is returned from the service. Pressing the AI Service hotkey a second time will clear the display and resume the core.
 
-## Supported Cores
-
-Since RetroArch v1.8.0, all cores should now be supported if your build has menu widgets.  If not, then only software cores will be supported.
+`AI Service Text Position Override` can be used to control the placement of the subtitles on screen when `AI Service Output` is set to `Text Mode`. By default, services are able to control whether a specific subtitle should be displayed at the top or at the bottom of the screen depending on the situation. This setting however ignores the service hint and forces the placement to be one or the other, at the user's discretion. `AI Service Text Padding` allows a more precise control of the placement of the subtitle adding blank space at the bottom of the screen (for bottom-placed subtitles) or at the top of the screen (for top-placed subtitles).
 
+When the service is used to provide translation or text-to-speech using OCR, and `Source Language` is set to `Don't care`, the service will attempt to auto-detect the language on screen. Setting it to a specific language will increase accuracy, and restrict translation to only text in the source language specified. If `Target Language` is set to `Don't care` then the translation will be provided in English, or in the selected language otherwise.
 
-## Alternative Services
+## Automatic mode
 
-If you have issues setting up the vgtranslate service, or don't want to run a local service yourself, you can use a service someone else has set up.  One example is the ZTranslate service ( https://ztranslate.net/docs/service ).  In this case, you can use the following url for `AI Service URL`:
+A special case must be made when the AI service needs to run automatically. By default, the AI service runs in a manual mode. The user presses the AI Service hotkey to process one screen immediately, receives the result, eventually presses the hotkey again to dismiss it and moves on before requesting the AI service once more. Some services however are able to run automatically. This can only be enabled by services themselves and is mostly designed for local services due to the high number of requests per second.
 
-```
-http://ztranslate.net/service?api_key=<YOUR_ZTRANSLATE_API_KEY>
-```
+If your service supports automatic mode, press the AI service hotkey once to enable processing. Now, the service will be polled at regular intervals and results will be automatically displayed as you keep playing. Pressing the AI service hotkey a second time (or invoking RetroArch's menu) will turn the AI service off.
 
-This requires registering an account with ztranslate.net to get an API key.  Due to uploading of the screen cap to the server, latency will likely be a bit higher than using a vgtranslate service running locally.  See the ztranslate docs for more information.
+The `AI Service Auto-Polling Delay` option in the AI Service settings control how fast the automatic requests are sent to the service. If you are experiencing slowdowns during play, try increasing the delay. It will lower the reactivity of the AI service, but will lessen the load on your CPU.
 
-## Alternative Translators
+## Known Services
 
-There are some other options for translating game screens that don't require a special build of RetroArch and should work with hardware buffer cores.  Here are some:
+[VGTranslate](https://gitlab.com/spherebeaker/vgtranslate) is a python server you can run locally (or on your network) and uses Google Cloud OCR, and Google Text-to-Speech APIs with the Google Cloud keys you provide.
 
-### ZTranslate
+[ZTranslate](https://ztranslate.net) uses a standalone client app for Windows or Linux that grabs the screen of the window currently in focus and displays a translated version in the ZTranslate client window. Also supports package-based translations for curated translations.
 
-[ZTranslate](https://ztranslate.net) uses a standalone client app for Windows or Linux that grabs the screen of the window currently in focus and displays a translated version in the ZTranslate client window.  Besides automatic translation (which needs a ztranslate.net api key), it also supports package-based translations, which is like a translation patch for a rom, but for game screens instead.  For more information see https://ztranslate.net/docs
+[RetroArch-AI-with-IoTEdge](https://github.com/toolboc/RetroArch-AI-with-IoTEdge) uses IoTEdge and Azure Cognitive Services Containers (requires microsoft Azure account) to translate RetroArch screenshots and display them on a Lakka device. 
 
-### RetroArch-AI-with-IoTEdge
+## Supported Cores
 
-[RetroArch-AI-with-IoTEdge](https://github.com/toolboc/RetroArch-AI-with-IoTEdge) uses IoTEdge and Azure Cognitive Services Containers (requires microsoft Azure account) to translate RetroArch screenshots and display them on a Lakka device.  For more information see https://github.com/toolboc/RetroArch-AI-with-IoTEdge/blob/master/README.md
+Since RetroArch v1.8.0, all cores should now be supported if your build has menu widgets. If not, then only software cores will be supported.
+
+## For developers
+
+If you wish to implement your own AI service, here is what you need. RetroArch sends a POST request every time the user invokes the AI service, the URL params are as provided.
+
+- `source_lang` (optional): language code of the content currently running.
+- `target_lang` (optional): language of the content to return.
+- `output`: comma-separated list of formats that must be provided by the service. Also lists sub-formats supported by the current RetroArchBuild.
+
+The currently supported formats are:
+- `sound`: raw audio to play back. (`wav`)
+- `text`: text to be read through internal text-to-speech capabilities. `subs` can be specified on top of that to explain that we are looking for short text response in the manner of subtitles.
+- `image`: image to display on top of the video feed. (`bmp`, `png`, `png-a`) All in 24-bits BGR formats.
+
+In addition, the request contains a JSON payload, formatted as such:
+- `image`: captured frame from the currently running content (in base64).
+- `format`: format of the captured frame (`png`, or `bmp`).
+- `coords`: array describing the coordinates of the image within the viewport space (x, y, width, height).
+- `viewport`: array describing the size of the viewport (width, height).
+- `label`: a text string describing the content (`<system id>__<content id>`).
+- `state`: a JSON object describing the state of the frontend, containing:
+    - `paused`: 1 if the content has been paused, 0 otherwise.
+    - `<key>`: the name of a retropad input, valued 1 if pressed. Values can be: a, b, x, y, l, r, l2, r2, l3, r3, up, down, left, right, start, select.
+
+The translation component then expects a response from the AI service in the form of a JSON payload, formatted as such:
+- `image`: base64 representation of an image in a supported format.
+- `sound`: base64 representation of a sound byte in a supported format.
+- `text`: results from the service as a string.
+- `text_position`: hint for the position of the text when the service is running in text mode (ie subtitles). Position is a number, 1 for Bottom or 2 for Top (defaults to bottom).
+- `press`: a list of retropad input to forcibly press. On top of the expected keys (cf. state, above) values `pause` and `unpause` can be specified to control the flow of the content.
+- `error`: any error encountered with the request.
+- `auto`: either `auto` or `continue` to control automatic requests.
+
+All fields are optional, but at least one of them must be present. If `error` is set, the error is shown to the user and everything else is ignored, even `auto` settings.
+
+With `auto` on `auto`, RetroArch will automatically send a new request (with a minimum delay enforced by ai_service_poll_delay); with a value of `continue`, RetroArch will ignore the returned content and skip to the next automatic request. This allows the service to specify that the returned content is the same as the one previously sent, so RetroArch does not need to update its display unless necessary. With `continue` the service *must* still send the content, as we may need to display it if the user paused the AI service for instance.