Skip to content

Latest commit

 

History

History
569 lines (473 loc) · 22.2 KB

DIRECT_LINE_SPEECH.md

File metadata and controls

569 lines (473 loc) · 22.2 KB

Using Direct Line Speech

For Cognitive Services Speech Services, please refer to SPEECH.md.

This guide is for using Web Chat with chat and speech functionality provided by the Direct Line Speech protocol.

We assume you have already set up a Direct Line Speech bot and have Web Chat running on your webpage.

Sample code in this article is optimized for modern browsers. You may need to use a transpiler (e.g. Babel) to target a broader range of browsers.

What is Direct Line Speech?

Direct Line Speech is designed for Voice Assistant scenario. For example, smart display, automotive dashboard, navigation system with low-latency requirement on single-page application and progressive web apps (PWA). These apps usually are made with highly-customized UI and do not show conversation transcripts.

You can look at our sample 06.recomposing-ui/b.speech-ui and 06.recomposing-ui/c.smart-display for target scenarios.

Direct Line Speech is not recommended to use on traditional websites where its primary UI is transcript-based.

Support matrix

Chrome/Microsoft Edge
and Firefox
on desktop
Chrome
on Android
Safari
on macOS/iOS
Web View
on Android
Web View
on iOS
STT Basic recognition 4.7 4.7 4.7 4.7 *1
STT Custom Speech (Details) *1
STT Interims/Partial Recognition 4.7 4.7 4.7 4.7 *1
STT Select language at initialization 4.7 4.7 4.7 4.7 *1
STT Input hint 4.7 4.7 4.7 4.7 *1
STT Select input device *3 *3 *3 *3 *1
STT Dynamic priming (Details) *1
STT Reference grammar ID (Details) *1
STT Select language on-the-fly (Details) *1
STT Text normalization options (Details) *1
STT Abort recognition (Details) *1
TTS Basic synthesis using text 4.7 4.7 4.7 4.7 *2
TTS Speech Synthesis Markup Language 4.7 4.7 4.7 4.7 *2
TTS Custom Voice (Details) *2
TTS Selecting voice/pitch/rate/volume 4.7 4.7 4.7 4.7 *2
TTS Override using "speak" property 4.7 4.7 4.7 4.7 *2
TTS Interrupt synthesis when clicking on microphone button 4.7 4.7 4.7 4.7 *2
TTS Text-to-speech audio format (Details) *2
TTS Stripping text from Markdown (Details) *2
TTS Adaptive Cards using "speak" property (Details) *2
TTS Synthesize activity with multiple attachments (Details) *2

Notes

  1. Web View on iOS is not a full browser. It does not have audio recording capabilities, which is required for Cognitive Services
  2. As speech recognition is not working (see above), speech synthesis is not tested
  3. Cognitive Services currently has a bug on selecting a different device for audio recording

Requirements

Direct Line Speech does not support Internet Explorer 11. It requires modern browser media capabilities that are not available in IE11.

Direct Line Speech shares the same requirements as Cognitive Services Speech Services. Please refer to SPEECH.md.

How to get started

Before start, please create corresponding Azure resources. You can follow this tutorial for enabling voice in your bot. You do not need to follow the steps for creating C# client, you will replace the client with Web Chat.

Please look at our sample 03.speech/a.direct-line-speech to embedding Web Chat on your web app via Direct Line Speech channel.

You will need to use Web Chat 4.7 or higher for Direct Line Speech.

After setting up Direct Line Speech on Azure Bot Services, there are two steps for using Direct Line Speech:

Retrieve your Direct Line Speech credentials

You should always use authorization token when authorizing with Direct Line Speech.

To secure the conversation, you will need to set up a REST API to generate the credentials. When called, it will return authorization token and region for your Direct Line Speech channel.

In the following code snippets, we assume sending a HTTP POST request to https://webchat-mockbot-streaming.azurewebsites.net/speechservices/token will return with a JSON with authorizationToken and region.

const fetchCredentials = async () => {
  const res = await fetch('https://webchat-mockbot-streaming.azurewebsites.net/speechservices/token', {
    method: 'POST'
  });

  if (!res.ok) {
    throw new Error('Failed to fetch authorization token and region.');
  }

  const { authorizationToken, region } = await res.json();

  return { authorizationToken, region };
};

Since the token expire after 10 minutes, it is advised to cache this token for 5 minutes. You can use either HTTP header Cache-Control on the REST API, or implement a memoization function in the browser.

Render Web Chat using Direct Line Speech adapters

After you have the fetchCredentials function set up, you can pass it to createDirectLineSpeechAdapters function. This function will return a set of adapters that is used by Web Chat. It includes DirectLineJS adapter and Web Speech adapter.

const adapters = await window.WebChat.createDirectLineSpeechAdapters({
  fetchCredentials
});

window.WebChat.renderWebChat(
  {
    ...adapters
  },
  document.getElementById('webchat')
);

The code above will requires transpilation for browser which do not support the spread operator.

Supported options

These are the options to pass when calling createDirectLineSpeechAdapters.

Name Type Default Description
audioConfig AudioConfig fromDefaultMicrophoneInput() Audio input object to use in Speech SDK.
audioContext AudioContext window.AudioContext || window.webkitAudioContext AudioContext used for constructing audio graph used for speech synthesis. Can be used to prime the Web Audio engine or as a ponyfill.
audioInputDeviceId string undefined Device ID of the audio input device. Ignored if audioConfig is specified.
fetchCredentials DirectLineSpeechCredentials (Required) An asynchronous function to fetch credentials, including either hostname or region, and either authorization token or subscription key.
speechRecognitionLanguage string
window?.navigator?.language ||
'en-US'
Language used for speech recognition
userID string (A random ID) User ID for all outgoing activities.
username string undefined Username for all outgoing activities.

DirectLineSpeechCredentials

type DirectLineSpeechCredentials = {
  authorizationToken: string,
  region: string
} || {
  authorizationToken: string,
  directLineSpeechHostname: string
} || {
  region: string,
  subscriptionKey: string
} || {
  directLineSpeechHostname: string,
  subscriptionKey: string
}

For public clouds, we recommend using the region option, such as "westus2".

For sovereign clouds, you should specify the hostname in FQDN through directLineSpeechHostname option, such as "virginia.convai.speech.azure.us".

Known issues

Differences in conversationUpdate behaviors

Please vote on this bug if this behavior is not desirable.

You can specify user ID when you instantiate Web Chat.

  • If you specify user ID
    • conversationUpdate activity will be send on connect and every reconnect. And with your user ID specified in the membersAdded field.
    • All message activities will be sent with your user ID in from.id field.
  • If you do not specify user ID
    • conversationUpdate activity will be send on connect and every reconnect. The membersAdded field will have an user ID of empty string.
    • All message activities will be sent with a randomized user ID
      • The user ID is kept the same across reconnections

Connection idle and reconnection

Please vote on this bug if this behavior is not desirable.

After idling for 5 minutes, the Web Socket connection will be disconnected. If the client is still active, we will try to reconnect. On every reconnect, a conversationUpdate activity will be sent.

Text normalization option is not supported

Please vote on this bug if this behavior is not desirable.

Currently, there is no options to specify different text normalization options, including inverse text normalization (ITN), masked ITN, lexical, and display.

Page refresh will start a new conversation

Please vote on this bug if this behavior is not desirable.

Web Chat do not persist conversation information (conversation ID and connection ID). Thus, on every page refresh, the conversation will be created as a new conversation.

Conversation history are not stored and resent

Direct Line Speech is not targeting a transcript-based experience. Thus, our servers will no longer store conversation history. We do not plan to support this feature.

No additional data can be piggybacked on speech recognition

Please vote on this bug if this behavior is not desirable.

When using text-based experience, we allow developers to piggyback additional information to outgoing messages. This is demonstrated in sample 15.a "piggyback data on every outgoing activity".

With Direct Line Speech, you can no longer piggyback additional data to all speech-based outgoing activities.

Speech recognition language cannot be switched on-the-fly

Please vote on this bug if this behavior is not desirable.

You can only specify speech recognition language at initialization time. You cannot switch speech recognition language while the conversation is active.

Proactive message is not supported

Please vote on this bug if this behavior is not desirable.

Proactive message is not supported when using Direct Line Speech.

Abort recognition is not supported

Please vote on this bug if this behavior is not desirable.

After the user click on microphone button to start speech recognition, they cannot click microphone button again to abort the recognition. What they have said will continue to be recognized and send to the bot.

Custom Speech is not supported

Custom Speech is a feature for developers to train a custom speech model to improve speech recognition for uncommon words. You can set this up using the Speech SDK or in the Azure portal when configuring the Direct Line Speech channel.

Dynamic priming is not supported

Please vote on this bug if this behavior is not desirable.

Dynmic priming (a.k.a. pharse list) is a feature to improve speech recognition for words with similar pronunciations. This is not supported when using Direct Line Speech.

Reference grammer ID is not supported

Please vote on this bug if this behavior is not desirable.

Reference grammar ID is a feature to improve speech recognition accuracy when pairing with LUIS. This is not supported when using Direct Line Speech.

Custom Voice is not supported

Please vote on this bug if this behavior is not desirable.

Custom Voice is a feature for developers to perform synthesis using a custom voice font. This is not supported when using Direct Line Speech.

Synthesis audio quality is not configurable

Please vote on this bug if this behavior is not desirable.

When using Direct Line Speech, you cannot specify the audio quality and format for synthesizing speech.

Alternative for Markdown

When the bot send activities to the user, it can send both plain text and Markdown. If Markdown is sent, the bot should also provide speak field. The speak field will be used for speech synthesis and not displayed to end-user.

Attachments are not synthesized

Please vote on this bug if this behavior is not desirable.

Attachments are not synthesized. The bot should provide a speak field for speech synthesis.

As attachments are not synthesized, speak property in Adaptive Cards are ignored. The bot should provide a speak field for speech synthesis.

Selecting voice

Please submit a feature request if this behavior is not desirable.

Voice can only be selected using Speech Synthesis Markup Language (SSML). For example, the following bot code will use a Japanese voice "NanamiNeural" for synthesis.

await context.sendActivity(
  MessageFactory.text(
    `Echo: ${context.activity.text}`,
    `
    <speak
      version="1.0"
      xmlns="https://www.w3.org/2001/10/synthesis"
      xmlns:mstts="https://www.w3.org/2001/mstts"
      xml:lang="en-US"
    >
      <voice name="ja-JP-NanamiNeural">素晴らしい!</voice>
    </speak>
    `
  )
);

Please refer to this article on SSML support.