Continuous recognition incorrenct word level timestamps #394

KarolScibior · 2021-07-06T08:23:12Z

Hello, I've been using microsoft-cognitiveservices-speech-sdk library for Node.js for speech recognition and I've stumbled upon a strange bug. I'm starting SpeechRecognizer with startContinuousRecognitionAsync method and everything seems fine, I'm getting correct transcriptions, until audio files I'm trying to recognize are longer than half an hour. With 30 minutes audio file (podcast - just talking, no music) at about 19th minute timestamps for words are reseted, but offset and duration for whole block is correct. Previous block has the same offset for whole block as well as for first word in this block, but here (you can see response below) block offset (11298000000) is different than first word offset (8500000).

{ "privResultId": "7E5DCD04582F42678F1448824DEE8132", "privReason": 3, "privText": "I do tego się odniosą. Tu wspomniałem o tym, że jeżeli mógłby się pojawić właśnie motyw że NI jest w ciąży. No bo w komiksach N Brok ona wie wzięli ślub i eddie Brock mieli dzieciaka. Tam był cały motyw. Niedawno go Edi odzyskał i relacja między dylanem rokiem, czyli synem Diego a samym edim to jest jedno z największych i to jest największe zło to w historii będą jakie jest, jakie jest pisane? Nie więc może ktoś by zerknął tam, a niekoniecznie na jedzenie ludzi mózgów z lat 90? Ja myślę, że.", "privDuration": 293500000, "privOffset": 11298000000, "privJson": "{\"Id\":\"5ba63698ed854e5aaffb42ae1f6b456a\",\"RecognitionStatus\":\"Success\",\"Offset\":8500000,\"Duration\":293500000,\"DisplayText\":\"I do tego się odniosą. Tu wspomniałem o tym, że jeżeli mógłby się pojawić właśnie motyw że NI jest w ciąży. No bo w komiksach N Brok ona wie wzięli ślub i eddie Brock mieli dzieciaka. Tam był cały motyw. Niedawno go Edi odzyskał i relacja między dylanem rokiem, czyli synem Diego a samym edim to jest jedno z największych i to jest największe zło to w historii będą jakie jest, jakie jest pisane? Nie więc może ktoś by zerknął tam, a niekoniecznie na jedzenie ludzi mózgów z lat 90? Ja myślę, że.\",\"NBest\":[{\"Confidence\":0.936767,\"Lexical\":\"i do tego się odniosą tu wspomniałem o tym że jeżeli mógłby się pojawić właśnie motyw że n i jest w ciąży no bo w komiksach n brok ona wie wzięli ślub i eddie brock mieli dzieciaka tam był cały motyw niedawno go edi odzyskał i relacja między dylanem rokiem czyli synem diego a samym edim to jest jedno z największych i to jest największe zło to w historii będą jakie jest jakie jest pisane nie więc może ktoś by zerknął tam a niekoniecznie na jedzenie ludzi mózgów z lat dziewięćdziesiątych ja myślę że\",\"ITN\":\"i do tego się odniosą tu wspomniałem o tym że jeżeli mógłby się pojawić właśnie motyw że NI jest w ciąży no bo w komiksach n brok ona wie wzięli ślub i eddie brock mieli dzieciaka tam był cały motyw niedawno go edi odzyskał i relacja między dylanem rokiem czyli synem diego a samym edim to jest jedno z największych i to jest największe zło to w historii będą jakie jest jakie jest pisane nie więc może ktoś by zerknął tam a niekoniecznie na jedzenie ludzi mózgów z lat 90 ja myślę że\",\"MaskedITN\":\"i do tego się odniosą tu wspomniałem o tym że jeżeli mógłby się pojawić właśnie motyw że ni jest w ciąży no bo w komiksach n brok ona wie wzięli ślub i eddie brock mieli dzieciaka tam był cały motyw niedawno go edi odzyskał i relacja między dylanem rokiem czyli synem diego a samym edim to jest jedno z największych i to jest największe zło to w historii będą jakie jest jakie jest pisane nie więc może ktoś by zerknął tam a niekoniecznie na jedzenie ludzi mózgów z lat 90 ja myślę że\",\"Display\":\"I do tego się odniosą. Tu wspomniałem o tym, że jeżeli mógłby się pojawić właśnie motyw że NI jest w ciąży. No bo w komiksach N Brok ona wie wzięli ślub i eddie Brock mieli dzieciaka. Tam był cały motyw. Niedawno go Edi odzyskał i relacja między dylanem rokiem, czyli synem Diego a samym edim to jest jedno z największych i to jest największe zło to w historii będą jakie jest, jakie jest pisane? Nie więc może ktoś by zerknął tam, a niekoniecznie na jedzenie ludzi mózgów z lat 90? Ja myślę, że.\",\"Words\":[{\"Word\":\"i\",\"Offset\":8500000,\"Duration\":1700000},{\"Word\":\"do\",\"Offset\":10300000,\"Duration\":1300000},{\"Word\":\"tego\",\"Offset\":11700000,\"Duration\":2900000},{\"Word\":\"się\",\"Offset\":14700000,\"Duration\":1700000},{\"Word\":\"odniosą\",\"Offset\":16500000,\"Duration\":2700000},{\"Word\":\"tu\",\"Offset\":19300000,\"Duration\":900000},{\"Word\":\"wspomniałem\",\"Offset\":20300000,\"Duration\":3900000},{\"Word\":\"o\",\"Offset\":24300000,\"Duration\":500000},{\"Word\":\"tym\",\"Offset\":24900000,\"Duration\":1500000},{\"Word\":\"że\",\"Offset\":26500000,\"Duration\":1500000},{\"Word\":\"jeżeli\",\"Offset\":28100000,\"Duration\":2300000},{\"Word\":\"mógłby\",\"Offset\":30500000,\"Duration\":2500000},{\"Word\":\"się\",\"Offset\":33100000,\"Duration\":900000},{\"Word\":\"pojawić\",\"Offset\":34100000,\"Duration\":4000000},{\"Word\":\"właśnie\",\"Offset\":38200000,\"Duration\":2100000},{\"Word\":\"motyw\",\"Offset\":40400000,\"Duration\":3200000},{\"Word\":\"że\",\"Offset\":43700000,\"Duration\":3900000},{\"Word\":\"n\",\"Offset\":50400000,\"Duration\":5500000},{\"Word\":\"i\",\"Offset\":56000000,\"Duration\":2000000},{\"Word\":\"jest\",\"Offset\":58100000,\"Duration\":1500000},{\"Word\":\"w\",\"Offset\":59700000,\"Duration\":500000},{\"Word\":\"ciąży\",\"Offset\":60300000,\"Duration\":2900000},{\"Word\":\"no\",\"Offset\":63300000,\"Duration\":900000},{\"Word\":\"bo\",\"Offset\":64300000,\"Duration\":1100000},{\"Word\":\"w\",\"Offset\":65500000,\"Duration\":500000},{\"Word\":\"komiksach\",\"Offset\":66100000,\"Duration\":6700000},{\"Word\":\"n\",\"Offset\":74700000,\"Duration\":3500000},{\"Word\":\"brok\",\"Offset\":78300000,\"Duration\":3700000},{\"Word\":\"ona\",\"Offset\":82100000,\"Duration\":3300000},{\"Word\":\"wie\",\"Offset\":88500000,\"Duration\":3200000},{\"Word\":\"wzięli\",\"Offset\":91800000,\"Duration\":2500000},{\"Word\":\"ślub\",\"Offset\":94400000,\"Duration\":3500000},{\"Word\":\"i\",\"Offset\":104200000,\"Duration\":6000000},{\"Word\":\"eddie\",\"Offset\":110500000,\"Duration\":2100000},{\"Word\":\"brock\",\"Offset\":112700000,\"Duration\":4300000},{\"Word\":\"mieli\",\"Offset\":118100000,\"Duration\":3100000},{\"Word\":\"dzieciaka\",\"Offset\":121300000,\"Duration\":5900000},{\"Word\":\"tam\",\"Offset\":127300000,\"Duration\":1500000},{\"Word\":\"był\",\"Offset\":128900000,\"Duration\":1100000},{\"Word\":\"cały\",\"Offset\":130100000,\"Duration\":2100000},{\"Word\":\"motyw\",\"Offset\":132300000,\"Duration\":3900000},{\"Word\":\"niedawno\",\"Offset\":136300000,\"Duration\":3900000},{\"Word\":\"go\",\"Offset\":140300000,\"Duration\":1500000},{\"Word\":\"edi\",\"Offset\":141900000,\"Duration\":2500000},{\"Word\":\"odzyskał\",\"Offset\":144500000,\"Duration\":4100000},{\"Word\":\"i\",\"Offset\":148700000,\"Duration\":700000},{\"Word\":\"relacja\",\"Offset\":149500000,\"Duration\":3700000},{\"Word\":\"między\",\"Offset\":153300000,\"Duration\":2100000},{\"Word\":\"dylanem\",\"Offset\":155500000,\"Duration\":4100000},{\"Word\":\"rokiem\",\"Offset\":159700000,\"Duration\":3300000},{\"Word\":\"czyli\",\"Offset\":163100000,\"Duration\":1900000},{\"Word\":\"synem\",\"Offset\":165100000,\"Duration\":4500000},{\"Word\":\"diego\",\"Offset\":169900000,\"Duration\":6500000},{\"Word\":\"a\",\"Offset\":177800000,\"Duration\":1600000},{\"Word\":\"samym\",\"Offset\":179500000,\"Duration\":3100000},{\"Word\":\"edim\",\"Offset\":182700000,\"Duration\":4100000},{\"Word\":\"to\",\"Offset\":187100000,\"Duration\":1400000},{\"Word\":\"jest\",\"Offset\":188600000,\"Duration\":1200000},{\"Word\":\"jedno\",\"Offset\":189900000,\"Duration\":1700000},{\"Word\":\"z\",\"Offset\":191700000,\"Duration\":300000},{\"Word\":\"największych\",\"Offset\":192100000,\"Duration\":4500000},{\"Word\":\"i\",\"Offset\":196700000,\"Duration\":200000},{\"Word\":\"to\",\"Offset\":197000000,\"Duration\":600000},{\"Word\":\"jest\",\"Offset\":197700000,\"Duration\":1200000},{\"Word\":\"największe\",\"Offset\":199000000,\"Duration\":3200000},{\"Word\":\"zło\",\"Offset\":202300000,\"Duration\":2100000},{\"Word\":\"to\",\"Offset\":204500000,\"Duration\":900000},{\"Word\":\"w\",\"Offset\":205500000,\"Duration\":500000},{\"Word\":\"historii\",\"Offset\":206100000,\"Duration\":3600000},{\"Word\":\"będą\",\"Offset\":209800000,\"Duration\":3200000},{\"Word\":\"jakie\",\"Offset\":213100000,\"Duration\":2400000},{\"Word\":\"jest\",\"Offset\":215600000,\"Duration\":2600000},{\"Word\":\"jakie\",\"Offset\":218500000,\"Duration\":2500000},{\"Word\":\"jest\",\"Offset\":221100000,\"Duration\":1500000},{\"Word\":\"pisane\",\"Offset\":222700000,\"Duration\":4300000},{\"Word\":\"nie\",\"Offset\":227100000,\"Duration\":2500000},{\"Word\":\"więc\",\"Offset\":234200000,\"Duration\":3200000},{\"Word\":\"może\",\"Offset\":238900000,\"Duration\":6900000},{\"Word\":\"ktoś\",\"Offset\":245900000,\"Duration\":3100000},{\"Word\":\"by\",\"Offset\":249100000,\"Duration\":1000000},{\"Word\":\"zerknął\",\"Offset\":250200000,\"Duration\":4400000},{\"Word\":\"tam\",\"Offset\":254700000,\"Duration\":2500000},{\"Word\":\"a\",\"Offset\":257300000,\"Duration\":500000},{\"Word\":\"niekoniecznie\",\"Offset\":257900000,\"Duration\":5200000},{\"Word\":\"na\",\"Offset\":263200000,\"Duration\":1000000},{\"Word\":\"jedzenie\",\"Offset\":264300000,\"Duration\":4100000},{\"Word\":\"ludzi\",\"Offset\":268500000,\"Duration\":3900000},{\"Word\":\"mózgów\",\"Offset\":272500000,\"Duration\":4700000},{\"Word\":\"z\",\"Offset\":277300000,\"Duration\":400000},{\"Word\":\"lat\",\"Offset\":277800000,\"Duration\":1500000},{\"Word\":\"dziewięćdziesiątych\",\"Offset\":279400000,\"Duration\":10200000},{\"Word\":\"ja\",\"Offset\":292500000,\"Duration\":2700000},{\"Word\":\"myślę\",\"Offset\":295300000,\"Duration\":2900000},

I think that there is definitely something wrong with Azure response, but just to be safe code below is the way I'm using SpeechRecognizer:

`speechConfig.requestWordLevelTimestamps()
speechConfig.enableDictation()
speechConfig.speechRecognitionLanguage = lang || 'pl-PL'
speechConfig.outputFormat = 1

const audioConfig = sdk.AudioConfig.fromWavFileInput(file)

const recognizer = new sdk.SpeechRecognizer(speechConfig, audioConfig)

const text = await generateText(recognizer)

//////////////////////////////////////////////////////////////////////////////////////////

const generateText = recognizer =>
new Promise((resolve, reject) => {
const results = []

 recognizer.startContinuousRecognitionAsync()

 const file = fs.createWriteStream(`./public/subtitles/azureResponse.txt`)
 file.on('error', err => console.log(err))
 file.on('finish', () => console.log('Finished writing file'))

 recognizer.recognizing = (s, e) => {
   console.log(`RECOGNIZING: Text=${e.result.text}`)
 }

 recognizer.recognized = (s, e) => {
   if (e.result.reason === ResultReason.RecognizedSpeech) {
     console.log(`RECOGNIZED: Text=${e.result.text}`)
     const subs = generateSubtitles(e.result)
     results.push(...subs)
     file.write(JSON.stringify(e.result))
   } else if (e.result.reason === ResultReason.NoMatch) {
     console.log('NOMATCH: Speech could not be recognized.')
   }
 }

 recognizer.canceled = (s, e) => {
   console.log(`CANCELED: Reason=${e.reason}`)
   if (e.reason === CancellationReason.Error) {
     console.log(`"CANCELED: ErrorCode=${e.errorCode}`)
     console.log(`"CANCELED: ErrorDetails=${e.errorDetails}`)
     console.log('CANCELED: Did you update the subscription info?')
     reject()
   }
   recognizer.stopContinuousRecognitionAsync()
 }

 recognizer.sessionStopped = (s, e) => {
   console.log('\n    Session stopped event.')
   recognizer.stopContinuousRecognitionAsync()
   resolve(results)
 }

})`

My ultimate goal is to transcribe audio files that are up to 10 hours long, but is the tool that I'm using correct for that? Or is this recognizer only for shorter audio files?

Thanks.

The text was updated successfully, but these errors were encountered:

glharper · 2021-07-06T18:07:52Z

@KarolScibior Thanks for using Speech SDK, and writing this issue up with code to reproduce it. The difference between the two offset values is because of how the SDK interacts with the backend service. The public offset for the result (privOffset, in your instance 11298000000) is relative to the start of the audio stream. The offset for the JSON and for the first word (8500000) indicates the offset within the current turn that the backend service is processing, but should be the sum of that and the total offset. I will add an item to fix this in JS SDK.

KarolScibior · 2021-07-08T13:27:06Z

Thanks for quick reply and fix, when can we expect new release with this fix?

glharper · 2021-07-08T16:00:24Z

@KarolScibior, we expect the next release, 1.18, to be available in two weeks.

glharper · 2021-07-21T23:43:37Z

@KarolScibior The latest version of SpeechSDK for JavaScript, which includes this fix, is now live, available for Node via npm install (npm page here) and browser via include (
https://cdn.jsdelivr.net/npm/microsoft-cognitiveservices-speech-sdk@latest/distrib/browser/microsoft.cognitiveservices.speech.sdk.bundle-min.js
). Thanks again for using Speech SDK!

KarolScibior · 2022-01-12T13:55:43Z

Hi, long time no see 🤓

I think there is still something buggy about speech recognition, there is still wrong Offset in last result item, to be precise. I have transcription from a really long audio file (755k lines JSON of total response) and the last item is as follows:

{ "privResultId": "DA15E9C7CE50497F836466DAE63A2EF8", "privReason": 3, "privText": "", "privDuration": 6500000, "privOffset": 89716600000, "privJson": { "Id": "774a0c66b39b4b56b8904bfd20177408", "RecognitionStatus": 0, "Offset": 4058900000, "Duration": 6500000, "DisplayText": "", "NBest": { "Confidence": 0.9367672, "Lexical": "w", "ITN": "w", "MaskedITN": "w", "Display": "w", "Words": [ { "Word": "w", "Offset": 4064600000, "Duration": 200000 } ] } } }

I kinda shortened it for your convenience (deleted privProperties and as NBest I'm showing only the one with highest confidence). Here is link to full JSON and audio WAV: https://we.tl/t-AnNAEaeh6i. It's too large for pastebin 😜 Everything is fine except this last item which is kinda weird.

glharper · 2022-01-12T19:12:35Z

@KarolScibior The issue is this line in my original fix:
if (!!this.privDetailedSpeechPhrase.NBest && !!this.privDetailedSpeechPhrase.NBest[0].Words) {
Note that the json you post has an NBest { } item, which means it's not an array. I need to add a case for this.privDetailedSpeechPhrase.NBest.Words. Thanks for letting me know, I should have a fix soon.

KarolScibior · 2022-01-12T19:59:14Z

NBest is always an array, at least I think. My snippet was mapped from this (original response):

{ "privResultId": "DA15E9C7CE50497F836466DAE63A2EF8", "privReason": 3, "privText": "", "privDuration": 6500000, "privOffset": 89716600000, "privJson": { "Id": "774a0c66b39b4b56b8904bfd20177408", "RecognitionStatus": 0, "Offset": 4058900000, "Duration": 6500000, "DisplayText": "", "NBest": [ { "Confidence": 0, "Lexical": "", "ITN": "", "MaskedITN": "", "Display": "" }, { "Confidence": 0.9367672, "Lexical": "w", "ITN": "w", "MaskedITN": "w", "Display": "w", "Words": [ { "Word": "w", "Offset": 4064600000, "Duration": 200000 } ] }, { "Confidence": 0.9367672, "Lexical": "o", "ITN": "o", "MaskedITN": "o", "Display": "o", "Words": [ { "Word": "o", "Offset": 4064600000, "Duration": 200000 } ] }, { "Confidence": 0.9367672, "Lexical": "z", "ITN": "z", "MaskedITN": "z", "Display": "z", "Words": [ { "Word": "z", "Offset": 4064600000, "Duration": 200000 } ] }, { "Confidence": 0.9367672, "Lexical": "y", "ITN": "y", "MaskedITN": "y", "Display": "y", "Words": [ { "Word": "y", "Offset": 4064600000, "Duration": 200000 } ] } ] }, "privProperties": { "privKeys": [ "SpeechServiceResponse_JsonResult" ], "privValues": { "Id": "774a0c66b39b4b56b8904bfd20177408", "RecognitionStatus": "Success", "Offset": 4058900000, "Duration": 6500000, "DisplayText": "", "NBest": [ { "Confidence": 0, "Lexical": "", "ITN": "", "MaskedITN": "", "Display": "" }, { "Confidence": 0.9367672, "Lexical": "w", "ITN": "w", "MaskedITN": "w", "Display": "w", "Words": [ { "Word": "w", "Offset": 4064600000, "Duration": 200000 } ] }, { "Confidence": 0.9367672, "Lexical": "o", "ITN": "o", "MaskedITN": "o", "Display": "o", "Words": [ { "Word": "o", "Offset": 4064600000, "Duration": 200000 } ] }, { "Confidence": 0.9367672, "Lexical": "z", "ITN": "z", "MaskedITN": "z", "Display": "z", "Words": [ { "Word": "z", "Offset": 4064600000, "Duration": 200000 } ] }, { "Confidence": 0.9367672, "Lexical": "y", "ITN": "y", "MaskedITN": "y", "Display": "y", "Words": [ { "Word": "y", "Offset": 4064600000, "Duration": 200000 } ] } ] } } }

Hope it helps.

KarolScibior · 2022-01-27T10:56:13Z

Hi, I think there is still a problem with Offsets in long transcriptions (language is polish). I have audio file that is 2,5h long and offsets are splitting at some moment. Here is example:

{
    "privResultId": "DE2D804CD4C940C6B2CAE68EA5760F49",
    "privReason": 3,
    "privText": "",
    "privDuration": 8000000,
    "privOffset": 14915300000,
    "privJson": {
      "Id": "b9d1d54e10e347668f2db8f6f4b1a7a3",
      "RecognitionStatus": 0,
      "Offset": 10954600000,
      "Duration": 8000000,
      "DisplayText": "",
      "NBest": [
        {
          "Confidence": 0,
          "Lexical": "",
          "ITN": "",
          "MaskedITN": "",
          "Display": ""
        },
        {
          "Confidence": 0.9367673,
          "Lexical": "tak czy nie",
          "ITN": "tak czy nie",
          "MaskedITN": "tak czy nie",
          "Display": "tak czy nie",
          "Words": [
            {
              "Word": "tak",
              "Offset": 10954800000,
              "Duration": 1900000
            },
            {
              "Word": "czy",
              "Offset": 10956800000,
              "Duration": 4900000
            },
            {
              "Word": "nie",
              "Offset": 10961800000,
              "Duration": 500000
            }
          ]
        },
        {
          "Confidence": 0.9367673,
          "Lexical": "tak trzymać",
          "ITN": "tak trzymać",
          "MaskedITN": "tak trzymać",
          "Display": "tak trzymać",
          "Words": [
            {
              "Word": "tak",
              "Offset": 10954800000,
              "Duration": 1900000
            },
            {
              "Word": "trzymać",
              "Offset": 10956800000,
              "Duration": 5500000
            }
          ]
        },
        {
          "Confidence": 0.9367673,
          "Lexical": "czy nasza",
          "ITN": "czy nasza",
          "MaskedITN": "czy nasza",
          "Display": "czy nasza",
          "Words": [
            {
              "Word": "czy",
              "Offset": 10956600000,
              "Duration": 1300000
            },
            {
              "Word": "nasza",
              "Offset": 10958000000,
              "Duration": 4300000
            }
          ]
        },
        {
          "Confidence": 0.9367673,
          "Lexical": "trzynastu",
          "ITN": "trzynastu",
          "MaskedITN": "trzynastu",
          "Display": "trzynastu",
          "Words": [
            {
              "Word": "trzynastu",
              "Offset": 10956600000,
              "Duration": 5700000
            }
          ]
        }
      ]
    },
    "privProperties": {
      "privKeys": [
        "SpeechServiceResponse_JsonResult"
      ],
      "privValues": {
        "Id": "b9d1d54e10e347668f2db8f6f4b1a7a3",
        "RecognitionStatus": "Success",
        "Offset": 10954600000,
        "Duration": 8000000,
        "DisplayText": "",
        "NBest": [
          {
            "Confidence": 0,
            "Lexical": "",
            "ITN": "",
            "MaskedITN": "",
            "Display": ""
          },
          {
            "Confidence": 0.9367673,
            "Lexical": "tak czy nie",
            "ITN": "tak czy nie",
            "MaskedITN": "tak czy nie",
            "Display": "tak czy nie",
            "Words": [
              {
                "Word": "tak",
                "Offset": 10954800000,
                "Duration": 1900000
              },
              {
                "Word": "czy",
                "Offset": 10956800000,
                "Duration": 4900000
              },
              {
                "Word": "nie",
                "Offset": 10961800000,
                "Duration": 500000
              }
            ]
          },
          {
            "Confidence": 0.9367673,
            "Lexical": "tak trzymać",
            "ITN": "tak trzymać",
            "MaskedITN": "tak trzymać",
            "Display": "tak trzymać",
            "Words": [
              {
                "Word": "tak",
                "Offset": 10954800000,
                "Duration": 1900000
              },
              {
                "Word": "trzymać",
                "Offset": 10956800000,
                "Duration": 5500000
              }
            ]
          },
          {
            "Confidence": 0.9367673,
            "Lexical": "czy nasza",
            "ITN": "czy nasza",
            "MaskedITN": "czy nasza",
            "Display": "czy nasza",
            "Words": [
              {
                "Word": "czy",
                "Offset": 10956600000,
                "Duration": 1300000
              },
              {
                "Word": "nasza",
                "Offset": 10958000000,
                "Duration": 4300000
              }
            ]
          },
          {
            "Confidence": 0.9367673,
            "Lexical": "trzynastu",
            "ITN": "trzynastu",
            "MaskedITN": "trzynastu",
            "Display": "trzynastu",
            "Words": [
              {
                "Word": "trzynastu",
                "Offset": 10956600000,
                "Duration": 5700000
              }
            ]
          }
        ]
      }
    }
  }

As you can see privOffset is different that Offset in privJson. Also first word in Words in NBest are sometimes the same as privOffset and sometimes the same as Offset in privJson.

Another question I have is about privText. What is the difference between privText, DisplayText in privJson and Display in NBests? In example above privText and displayText are empty as well as first NBest, but next items in NBest array are not empty.

Also why is confidence the same in every NBest item?

This example is not a single case, something is definitely off. Here is again link to full azure response log (prettiefied) and audio file: https://we.tl/t-mCL5A4nPg4

KarolScibior · 2022-01-31T15:36:21Z

@glharper bumping last comment

KarolScibior · 2022-02-16T09:58:46Z

@glharper @dargilco This is really serious issue, can you look into it please?

dargilco · 2022-02-17T00:20:26Z

Reopening, clearing owner for triage

KarolScibior · 2022-03-22T10:11:46Z

Hello, any news regarding this issue? @dargilco

glharper · 2022-04-01T20:21:28Z

@KarolScibior Please re-upload the file used to reproduce this, and I'll take a look.

KarolScibior · 2022-04-04T08:20:56Z

Link to audio file: https://we.tl/t-KdKweEF6ej

Recognition language is pl-PL

glharper · 2022-04-04T20:19:58Z

@KarolScibior, we have a batch transcription API that is the intended solution for long-form transcriptions like this use case. The docs for that service are here

Answers to your questions below.

Hi, I think there is still a problem with Offsets in long transcriptions (language is polish). I have audio file that is 2,5h long and offsets are splitting at some moment. Here is example:
[...]
As you can see privOffset is different that Offset in privJson.

For Recognized results (where result.reason === ResultReason.RecognizedSpeech (3)), this is as intended. The offset in the privJson should be ignored, as it is the offset reported from the service, which doesn't keep track of how much time the current turn has been running. (The word level offsets in the JSON are correct, as that was the original fix for this issue.) Please use the privOffset, which should be exposed via the result.offset property.

Also first word in Words in NBest are sometimes the same as privOffset and sometimes the same as Offset in privJson.

I'm not seeing that above...is that happening where result.reason === ResultReason.RecognizedSpeech (3) ?

Another question I have is about privText. What is the difference between privText, DisplayText in privJson and Display in NBests?

For simple recognition results, the privText will the DisplayText in privJson.
For detailed recognition results, the privText will be the Display of the first NBest element.
The service does seem to often send back an empty first element of NBest with an empty string as the Display value, and I don't know why. That seems wrong.

In example above privText and displayText are empty as well as first NBest, but next items in NBest array are not empty.

I can add a workaround in JS where the first element of NBest is ignored if the Display field is empty, but this seems like incorrect service behavior, and thus a somewhat fragile fix.

Also why is confidence the same in every NBest item?

Again, that's a service issue.

glharper added bug Something isn't working in review Acknowledged and being looked at now labels Jul 6, 2021

glharper self-assigned this Jul 6, 2021

glharper mentioned this issue Jul 6, 2021

Glharper/word level json offsets 394 #396

Merged

glharper closed this as completed in #396 Jul 7, 2021

glharper reopened this Jan 12, 2022

glharper mentioned this issue Jan 12, 2022

use first found words object on NBest to set firstWordOffset #467

Merged

glharper closed this as completed in #467 Jan 13, 2022

dargilco reopened this Feb 17, 2022

dargilco unassigned glharper Feb 17, 2022

glharper mentioned this issue Apr 7, 2022

SpeechRecognizer - utterance recognized reports wrong Offset sometimes #518

Closed

glharper added pending close Ready for closure pending follow-up or prolonged inactivity and removed in review Acknowledged and being looked at now bug Something isn't working labels May 10, 2022

glharper closed this as completed May 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continuous recognition incorrenct word level timestamps #394

Continuous recognition incorrenct word level timestamps #394

KarolScibior commented Jul 6, 2021

glharper commented Jul 6, 2021

KarolScibior commented Jul 8, 2021

glharper commented Jul 8, 2021

glharper commented Jul 21, 2021

KarolScibior commented Jan 12, 2022

glharper commented Jan 12, 2022

KarolScibior commented Jan 12, 2022

KarolScibior commented Jan 27, 2022

KarolScibior commented Jan 31, 2022

KarolScibior commented Feb 16, 2022

dargilco commented Feb 17, 2022

KarolScibior commented Mar 22, 2022

glharper commented Apr 1, 2022

KarolScibior commented Apr 4, 2022

glharper commented Apr 4, 2022 •

edited

Loading

Continuous recognition incorrenct word level timestamps #394

Continuous recognition incorrenct word level timestamps #394

Comments

KarolScibior commented Jul 6, 2021

glharper commented Jul 6, 2021

KarolScibior commented Jul 8, 2021

glharper commented Jul 8, 2021

glharper commented Jul 21, 2021

KarolScibior commented Jan 12, 2022

glharper commented Jan 12, 2022

KarolScibior commented Jan 12, 2022

KarolScibior commented Jan 27, 2022

KarolScibior commented Jan 31, 2022

KarolScibior commented Feb 16, 2022

dargilco commented Feb 17, 2022

KarolScibior commented Mar 22, 2022

glharper commented Apr 1, 2022

KarolScibior commented Apr 4, 2022

glharper commented Apr 4, 2022 • edited Loading

glharper commented Apr 4, 2022 •

edited

Loading