Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuous recognition incorrenct word level timestamps #394

Closed
KarolScibior opened this issue Jul 6, 2021 · 15 comments · Fixed by #396 or #467
Closed

Continuous recognition incorrenct word level timestamps #394

KarolScibior opened this issue Jul 6, 2021 · 15 comments · Fixed by #396 or #467
Labels
pending close Ready for closure pending follow-up or prolonged inactivity

Comments

@KarolScibior
Copy link

Hello, I've been using microsoft-cognitiveservices-speech-sdk library for Node.js for speech recognition and I've stumbled upon a strange bug. I'm starting SpeechRecognizer with startContinuousRecognitionAsync method and everything seems fine, I'm getting correct transcriptions, until audio files I'm trying to recognize are longer than half an hour. With 30 minutes audio file (podcast - just talking, no music) at about 19th minute timestamps for words are reseted, but offset and duration for whole block is correct. Previous block has the same offset for whole block as well as for first word in this block, but here (you can see response below) block offset (11298000000) is different than first word offset (8500000).

{ "privResultId": "7E5DCD04582F42678F1448824DEE8132", "privReason": 3, "privText": "I do tego się odniosą. Tu wspomniałem o tym, że jeżeli mógłby się pojawić właśnie motyw że NI jest w ciąży. No bo w komiksach N Brok ona wie wzięli ślub i eddie Brock mieli dzieciaka. Tam był cały motyw. Niedawno go Edi odzyskał i relacja między dylanem rokiem, czyli synem Diego a samym edim to jest jedno z największych i to jest największe zło to w historii będą jakie jest, jakie jest pisane? Nie więc może ktoś by zerknął tam, a niekoniecznie na jedzenie ludzi mózgów z lat 90? Ja myślę, że.", "privDuration": 293500000, "privOffset": 11298000000, "privJson": "{\"Id\":\"5ba63698ed854e5aaffb42ae1f6b456a\",\"RecognitionStatus\":\"Success\",\"Offset\":8500000,\"Duration\":293500000,\"DisplayText\":\"I do tego się odniosą. Tu wspomniałem o tym, że jeżeli mógłby się pojawić właśnie motyw że NI jest w ciąży. No bo w komiksach N Brok ona wie wzięli ślub i eddie Brock mieli dzieciaka. Tam był cały motyw. Niedawno go Edi odzyskał i relacja między dylanem rokiem, czyli synem Diego a samym edim to jest jedno z największych i to jest największe zło to w historii będą jakie jest, jakie jest pisane? Nie więc może ktoś by zerknął tam, a niekoniecznie na jedzenie ludzi mózgów z lat 90? Ja myślę, że.\",\"NBest\":[{\"Confidence\":0.936767,\"Lexical\":\"i do tego się odniosą tu wspomniałem o tym że jeżeli mógłby się pojawić właśnie motyw że n i jest w ciąży no bo w komiksach n brok ona wie wzięli ślub i eddie brock mieli dzieciaka tam był cały motyw niedawno go edi odzyskał i relacja między dylanem rokiem czyli synem diego a samym edim to jest jedno z największych i to jest największe zło to w historii będą jakie jest jakie jest pisane nie więc może ktoś by zerknął tam a niekoniecznie na jedzenie ludzi mózgów z lat dziewięćdziesiątych ja myślę że\",\"ITN\":\"i do tego się odniosą tu wspomniałem o tym że jeżeli mógłby się pojawić właśnie motyw że NI jest w ciąży no bo w komiksach n brok ona wie wzięli ślub i eddie brock mieli dzieciaka tam był cały motyw niedawno go edi odzyskał i relacja między dylanem rokiem czyli synem diego a samym edim to jest jedno z największych i to jest największe zło to w historii będą jakie jest jakie jest pisane nie więc może ktoś by zerknął tam a niekoniecznie na jedzenie ludzi mózgów z lat 90 ja myślę że\",\"MaskedITN\":\"i do tego się odniosą tu wspomniałem o tym że jeżeli mógłby się pojawić właśnie motyw że ni jest w ciąży no bo w komiksach n brok ona wie wzięli ślub i eddie brock mieli dzieciaka tam był cały motyw niedawno go edi odzyskał i relacja między dylanem rokiem czyli synem diego a samym edim to jest jedno z największych i to jest największe zło to w historii będą jakie jest jakie jest pisane nie więc może ktoś by zerknął tam a niekoniecznie na jedzenie ludzi mózgów z lat 90 ja myślę że\",\"Display\":\"I do tego się odniosą. Tu wspomniałem o tym, że jeżeli mógłby się pojawić właśnie motyw że NI jest w ciąży. No bo w komiksach N Brok ona wie wzięli ślub i eddie Brock mieli dzieciaka. Tam był cały motyw. Niedawno go Edi odzyskał i relacja między dylanem rokiem, czyli synem Diego a samym edim to jest jedno z największych i to jest największe zło to w historii będą jakie jest, jakie jest pisane? Nie więc może ktoś by zerknął tam, a niekoniecznie na jedzenie ludzi mózgów z lat 90? Ja myślę, że.\",\"Words\":[{\"Word\":\"i\",\"Offset\":8500000,\"Duration\":1700000},{\"Word\":\"do\",\"Offset\":10300000,\"Duration\":1300000},{\"Word\":\"tego\",\"Offset\":11700000,\"Duration\":2900000},{\"Word\":\"się\",\"Offset\":14700000,\"Duration\":1700000},{\"Word\":\"odniosą\",\"Offset\":16500000,\"Duration\":2700000},{\"Word\":\"tu\",\"Offset\":19300000,\"Duration\":900000},{\"Word\":\"wspomniałem\",\"Offset\":20300000,\"Duration\":3900000},{\"Word\":\"o\",\"Offset\":24300000,\"Duration\":500000},{\"Word\":\"tym\",\"Offset\":24900000,\"Duration\":1500000},{\"Word\":\"że\",\"Offset\":26500000,\"Duration\":1500000},{\"Word\":\"jeżeli\",\"Offset\":28100000,\"Duration\":2300000},{\"Word\":\"mógłby\",\"Offset\":30500000,\"Duration\":2500000},{\"Word\":\"się\",\"Offset\":33100000,\"Duration\":900000},{\"Word\":\"pojawić\",\"Offset\":34100000,\"Duration\":4000000},{\"Word\":\"właśnie\",\"Offset\":38200000,\"Duration\":2100000},{\"Word\":\"motyw\",\"Offset\":40400000,\"Duration\":3200000},{\"Word\":\"że\",\"Offset\":43700000,\"Duration\":3900000},{\"Word\":\"n\",\"Offset\":50400000,\"Duration\":5500000},{\"Word\":\"i\",\"Offset\":56000000,\"Duration\":2000000},{\"Word\":\"jest\",\"Offset\":58100000,\"Duration\":1500000},{\"Word\":\"w\",\"Offset\":59700000,\"Duration\":500000},{\"Word\":\"ciąży\",\"Offset\":60300000,\"Duration\":2900000},{\"Word\":\"no\",\"Offset\":63300000,\"Duration\":900000},{\"Word\":\"bo\",\"Offset\":64300000,\"Duration\":1100000},{\"Word\":\"w\",\"Offset\":65500000,\"Duration\":500000},{\"Word\":\"komiksach\",\"Offset\":66100000,\"Duration\":6700000},{\"Word\":\"n\",\"Offset\":74700000,\"Duration\":3500000},{\"Word\":\"brok\",\"Offset\":78300000,\"Duration\":3700000},{\"Word\":\"ona\",\"Offset\":82100000,\"Duration\":3300000},{\"Word\":\"wie\",\"Offset\":88500000,\"Duration\":3200000},{\"Word\":\"wzięli\",\"Offset\":91800000,\"Duration\":2500000},{\"Word\":\"ślub\",\"Offset\":94400000,\"Duration\":3500000},{\"Word\":\"i\",\"Offset\":104200000,\"Duration\":6000000},{\"Word\":\"eddie\",\"Offset\":110500000,\"Duration\":2100000},{\"Word\":\"brock\",\"Offset\":112700000,\"Duration\":4300000},{\"Word\":\"mieli\",\"Offset\":118100000,\"Duration\":3100000},{\"Word\":\"dzieciaka\",\"Offset\":121300000,\"Duration\":5900000},{\"Word\":\"tam\",\"Offset\":127300000,\"Duration\":1500000},{\"Word\":\"był\",\"Offset\":128900000,\"Duration\":1100000},{\"Word\":\"cały\",\"Offset\":130100000,\"Duration\":2100000},{\"Word\":\"motyw\",\"Offset\":132300000,\"Duration\":3900000},{\"Word\":\"niedawno\",\"Offset\":136300000,\"Duration\":3900000},{\"Word\":\"go\",\"Offset\":140300000,\"Duration\":1500000},{\"Word\":\"edi\",\"Offset\":141900000,\"Duration\":2500000},{\"Word\":\"odzyskał\",\"Offset\":144500000,\"Duration\":4100000},{\"Word\":\"i\",\"Offset\":148700000,\"Duration\":700000},{\"Word\":\"relacja\",\"Offset\":149500000,\"Duration\":3700000},{\"Word\":\"między\",\"Offset\":153300000,\"Duration\":2100000},{\"Word\":\"dylanem\",\"Offset\":155500000,\"Duration\":4100000},{\"Word\":\"rokiem\",\"Offset\":159700000,\"Duration\":3300000},{\"Word\":\"czyli\",\"Offset\":163100000,\"Duration\":1900000},{\"Word\":\"synem\",\"Offset\":165100000,\"Duration\":4500000},{\"Word\":\"diego\",\"Offset\":169900000,\"Duration\":6500000},{\"Word\":\"a\",\"Offset\":177800000,\"Duration\":1600000},{\"Word\":\"samym\",\"Offset\":179500000,\"Duration\":3100000},{\"Word\":\"edim\",\"Offset\":182700000,\"Duration\":4100000},{\"Word\":\"to\",\"Offset\":187100000,\"Duration\":1400000},{\"Word\":\"jest\",\"Offset\":188600000,\"Duration\":1200000},{\"Word\":\"jedno\",\"Offset\":189900000,\"Duration\":1700000},{\"Word\":\"z\",\"Offset\":191700000,\"Duration\":300000},{\"Word\":\"największych\",\"Offset\":192100000,\"Duration\":4500000},{\"Word\":\"i\",\"Offset\":196700000,\"Duration\":200000},{\"Word\":\"to\",\"Offset\":197000000,\"Duration\":600000},{\"Word\":\"jest\",\"Offset\":197700000,\"Duration\":1200000},{\"Word\":\"największe\",\"Offset\":199000000,\"Duration\":3200000},{\"Word\":\"zło\",\"Offset\":202300000,\"Duration\":2100000},{\"Word\":\"to\",\"Offset\":204500000,\"Duration\":900000},{\"Word\":\"w\",\"Offset\":205500000,\"Duration\":500000},{\"Word\":\"historii\",\"Offset\":206100000,\"Duration\":3600000},{\"Word\":\"będą\",\"Offset\":209800000,\"Duration\":3200000},{\"Word\":\"jakie\",\"Offset\":213100000,\"Duration\":2400000},{\"Word\":\"jest\",\"Offset\":215600000,\"Duration\":2600000},{\"Word\":\"jakie\",\"Offset\":218500000,\"Duration\":2500000},{\"Word\":\"jest\",\"Offset\":221100000,\"Duration\":1500000},{\"Word\":\"pisane\",\"Offset\":222700000,\"Duration\":4300000},{\"Word\":\"nie\",\"Offset\":227100000,\"Duration\":2500000},{\"Word\":\"więc\",\"Offset\":234200000,\"Duration\":3200000},{\"Word\":\"może\",\"Offset\":238900000,\"Duration\":6900000},{\"Word\":\"ktoś\",\"Offset\":245900000,\"Duration\":3100000},{\"Word\":\"by\",\"Offset\":249100000,\"Duration\":1000000},{\"Word\":\"zerknął\",\"Offset\":250200000,\"Duration\":4400000},{\"Word\":\"tam\",\"Offset\":254700000,\"Duration\":2500000},{\"Word\":\"a\",\"Offset\":257300000,\"Duration\":500000},{\"Word\":\"niekoniecznie\",\"Offset\":257900000,\"Duration\":5200000},{\"Word\":\"na\",\"Offset\":263200000,\"Duration\":1000000},{\"Word\":\"jedzenie\",\"Offset\":264300000,\"Duration\":4100000},{\"Word\":\"ludzi\",\"Offset\":268500000,\"Duration\":3900000},{\"Word\":\"mózgów\",\"Offset\":272500000,\"Duration\":4700000},{\"Word\":\"z\",\"Offset\":277300000,\"Duration\":400000},{\"Word\":\"lat\",\"Offset\":277800000,\"Duration\":1500000},{\"Word\":\"dziewięćdziesiątych\",\"Offset\":279400000,\"Duration\":10200000},{\"Word\":\"ja\",\"Offset\":292500000,\"Duration\":2700000},{\"Word\":\"myślę\",\"Offset\":295300000,\"Duration\":2900000},

I think that there is definitely something wrong with Azure response, but just to be safe code below is the way I'm using SpeechRecognizer:

`speechConfig.requestWordLevelTimestamps()
speechConfig.enableDictation()
speechConfig.speechRecognitionLanguage = lang || 'pl-PL'
speechConfig.outputFormat = 1

const audioConfig = sdk.AudioConfig.fromWavFileInput(file)

const recognizer = new sdk.SpeechRecognizer(speechConfig, audioConfig)

const text = await generateText(recognizer)

//////////////////////////////////////////////////////////////////////////////////////////

const generateText = recognizer =>
new Promise((resolve, reject) => {
const results = []

 recognizer.startContinuousRecognitionAsync()

 const file = fs.createWriteStream(`./public/subtitles/azureResponse.txt`)
 file.on('error', err => console.log(err))
 file.on('finish', () => console.log('Finished writing file'))

 recognizer.recognizing = (s, e) => {
   console.log(`RECOGNIZING: Text=${e.result.text}`)
 }

 recognizer.recognized = (s, e) => {
   if (e.result.reason === ResultReason.RecognizedSpeech) {
     console.log(`RECOGNIZED: Text=${e.result.text}`)
     const subs = generateSubtitles(e.result)
     results.push(...subs)
     file.write(JSON.stringify(e.result))
   } else if (e.result.reason === ResultReason.NoMatch) {
     console.log('NOMATCH: Speech could not be recognized.')
   }
 }

 recognizer.canceled = (s, e) => {
   console.log(`CANCELED: Reason=${e.reason}`)
   if (e.reason === CancellationReason.Error) {
     console.log(`"CANCELED: ErrorCode=${e.errorCode}`)
     console.log(`"CANCELED: ErrorDetails=${e.errorDetails}`)
     console.log('CANCELED: Did you update the subscription info?')
     reject()
   }
   recognizer.stopContinuousRecognitionAsync()
 }

 recognizer.sessionStopped = (s, e) => {
   console.log('\n    Session stopped event.')
   recognizer.stopContinuousRecognitionAsync()
   resolve(results)
 }

})`

My ultimate goal is to transcribe audio files that are up to 10 hours long, but is the tool that I'm using correct for that? Or is this recognizer only for shorter audio files?

Thanks.

@glharper glharper added bug Something isn't working in review Acknowledged and being looked at now labels Jul 6, 2021
@glharper glharper self-assigned this Jul 6, 2021
@glharper
Copy link
Member

glharper commented Jul 6, 2021

@KarolScibior Thanks for using Speech SDK, and writing this issue up with code to reproduce it. The difference between the two offset values is because of how the SDK interacts with the backend service. The public offset for the result (privOffset, in your instance 11298000000) is relative to the start of the audio stream. The offset for the JSON and for the first word (8500000) indicates the offset within the current turn that the backend service is processing, but should be the sum of that and the total offset. I will add an item to fix this in JS SDK.

@KarolScibior
Copy link
Author

Thanks for quick reply and fix, when can we expect new release with this fix?

@glharper
Copy link
Member

glharper commented Jul 8, 2021

@KarolScibior, we expect the next release, 1.18, to be available in two weeks.

@glharper
Copy link
Member

@KarolScibior The latest version of SpeechSDK for JavaScript, which includes this fix, is now live, available for Node via npm install (npm page here) and browser via include (
https://cdn.jsdelivr.net/npm/microsoft-cognitiveservices-speech-sdk@latest/distrib/browser/microsoft.cognitiveservices.speech.sdk.bundle-min.js
). Thanks again for using Speech SDK!

@KarolScibior
Copy link
Author

Hi, long time no see 🤓

I think there is still something buggy about speech recognition, there is still wrong Offset in last result item, to be precise. I have transcription from a really long audio file (755k lines JSON of total response) and the last item is as follows:

{ "privResultId": "DA15E9C7CE50497F836466DAE63A2EF8", "privReason": 3, "privText": "", "privDuration": 6500000, "privOffset": 89716600000, "privJson": { "Id": "774a0c66b39b4b56b8904bfd20177408", "RecognitionStatus": 0, "Offset": 4058900000, "Duration": 6500000, "DisplayText": "", "NBest": { "Confidence": 0.9367672, "Lexical": "w", "ITN": "w", "MaskedITN": "w", "Display": "w", "Words": [ { "Word": "w", "Offset": 4064600000, "Duration": 200000 } ] } } }

I kinda shortened it for your convenience (deleted privProperties and as NBest I'm showing only the one with highest confidence). Here is link to full JSON and audio WAV: https://we.tl/t-AnNAEaeh6i. It's too large for pastebin 😜 Everything is fine except this last item which is kinda weird.

@glharper glharper reopened this Jan 12, 2022
@glharper
Copy link
Member

@KarolScibior The issue is this line in my original fix:
if (!!this.privDetailedSpeechPhrase.NBest && !!this.privDetailedSpeechPhrase.NBest[0].Words) {
Note that the json you post has an NBest { } item, which means it's not an array. I need to add a case for this.privDetailedSpeechPhrase.NBest.Words. Thanks for letting me know, I should have a fix soon.

@KarolScibior
Copy link
Author

NBest is always an array, at least I think. My snippet was mapped from this (original response):

{ "privResultId": "DA15E9C7CE50497F836466DAE63A2EF8", "privReason": 3, "privText": "", "privDuration": 6500000, "privOffset": 89716600000, "privJson": { "Id": "774a0c66b39b4b56b8904bfd20177408", "RecognitionStatus": 0, "Offset": 4058900000, "Duration": 6500000, "DisplayText": "", "NBest": [ { "Confidence": 0, "Lexical": "", "ITN": "", "MaskedITN": "", "Display": "" }, { "Confidence": 0.9367672, "Lexical": "w", "ITN": "w", "MaskedITN": "w", "Display": "w", "Words": [ { "Word": "w", "Offset": 4064600000, "Duration": 200000 } ] }, { "Confidence": 0.9367672, "Lexical": "o", "ITN": "o", "MaskedITN": "o", "Display": "o", "Words": [ { "Word": "o", "Offset": 4064600000, "Duration": 200000 } ] }, { "Confidence": 0.9367672, "Lexical": "z", "ITN": "z", "MaskedITN": "z", "Display": "z", "Words": [ { "Word": "z", "Offset": 4064600000, "Duration": 200000 } ] }, { "Confidence": 0.9367672, "Lexical": "y", "ITN": "y", "MaskedITN": "y", "Display": "y", "Words": [ { "Word": "y", "Offset": 4064600000, "Duration": 200000 } ] } ] }, "privProperties": { "privKeys": [ "SpeechServiceResponse_JsonResult" ], "privValues": { "Id": "774a0c66b39b4b56b8904bfd20177408", "RecognitionStatus": "Success", "Offset": 4058900000, "Duration": 6500000, "DisplayText": "", "NBest": [ { "Confidence": 0, "Lexical": "", "ITN": "", "MaskedITN": "", "Display": "" }, { "Confidence": 0.9367672, "Lexical": "w", "ITN": "w", "MaskedITN": "w", "Display": "w", "Words": [ { "Word": "w", "Offset": 4064600000, "Duration": 200000 } ] }, { "Confidence": 0.9367672, "Lexical": "o", "ITN": "o", "MaskedITN": "o", "Display": "o", "Words": [ { "Word": "o", "Offset": 4064600000, "Duration": 200000 } ] }, { "Confidence": 0.9367672, "Lexical": "z", "ITN": "z", "MaskedITN": "z", "Display": "z", "Words": [ { "Word": "z", "Offset": 4064600000, "Duration": 200000 } ] }, { "Confidence": 0.9367672, "Lexical": "y", "ITN": "y", "MaskedITN": "y", "Display": "y", "Words": [ { "Word": "y", "Offset": 4064600000, "Duration": 200000 } ] } ] } } }

Hope it helps.

@KarolScibior
Copy link
Author

Hi, I think there is still a problem with Offsets in long transcriptions (language is polish). I have audio file that is 2,5h long and offsets are splitting at some moment. Here is example:

{
    "privResultId": "DE2D804CD4C940C6B2CAE68EA5760F49",
    "privReason": 3,
    "privText": "",
    "privDuration": 8000000,
    "privOffset": 14915300000,
    "privJson": {
      "Id": "b9d1d54e10e347668f2db8f6f4b1a7a3",
      "RecognitionStatus": 0,
      "Offset": 10954600000,
      "Duration": 8000000,
      "DisplayText": "",
      "NBest": [
        {
          "Confidence": 0,
          "Lexical": "",
          "ITN": "",
          "MaskedITN": "",
          "Display": ""
        },
        {
          "Confidence": 0.9367673,
          "Lexical": "tak czy nie",
          "ITN": "tak czy nie",
          "MaskedITN": "tak czy nie",
          "Display": "tak czy nie",
          "Words": [
            {
              "Word": "tak",
              "Offset": 10954800000,
              "Duration": 1900000
            },
            {
              "Word": "czy",
              "Offset": 10956800000,
              "Duration": 4900000
            },
            {
              "Word": "nie",
              "Offset": 10961800000,
              "Duration": 500000
            }
          ]
        },
        {
          "Confidence": 0.9367673,
          "Lexical": "tak trzymać",
          "ITN": "tak trzymać",
          "MaskedITN": "tak trzymać",
          "Display": "tak trzymać",
          "Words": [
            {
              "Word": "tak",
              "Offset": 10954800000,
              "Duration": 1900000
            },
            {
              "Word": "trzymać",
              "Offset": 10956800000,
              "Duration": 5500000
            }
          ]
        },
        {
          "Confidence": 0.9367673,
          "Lexical": "czy nasza",
          "ITN": "czy nasza",
          "MaskedITN": "czy nasza",
          "Display": "czy nasza",
          "Words": [
            {
              "Word": "czy",
              "Offset": 10956600000,
              "Duration": 1300000
            },
            {
              "Word": "nasza",
              "Offset": 10958000000,
              "Duration": 4300000
            }
          ]
        },
        {
          "Confidence": 0.9367673,
          "Lexical": "trzynastu",
          "ITN": "trzynastu",
          "MaskedITN": "trzynastu",
          "Display": "trzynastu",
          "Words": [
            {
              "Word": "trzynastu",
              "Offset": 10956600000,
              "Duration": 5700000
            }
          ]
        }
      ]
    },
    "privProperties": {
      "privKeys": [
        "SpeechServiceResponse_JsonResult"
      ],
      "privValues": {
        "Id": "b9d1d54e10e347668f2db8f6f4b1a7a3",
        "RecognitionStatus": "Success",
        "Offset": 10954600000,
        "Duration": 8000000,
        "DisplayText": "",
        "NBest": [
          {
            "Confidence": 0,
            "Lexical": "",
            "ITN": "",
            "MaskedITN": "",
            "Display": ""
          },
          {
            "Confidence": 0.9367673,
            "Lexical": "tak czy nie",
            "ITN": "tak czy nie",
            "MaskedITN": "tak czy nie",
            "Display": "tak czy nie",
            "Words": [
              {
                "Word": "tak",
                "Offset": 10954800000,
                "Duration": 1900000
              },
              {
                "Word": "czy",
                "Offset": 10956800000,
                "Duration": 4900000
              },
              {
                "Word": "nie",
                "Offset": 10961800000,
                "Duration": 500000
              }
            ]
          },
          {
            "Confidence": 0.9367673,
            "Lexical": "tak trzymać",
            "ITN": "tak trzymać",
            "MaskedITN": "tak trzymać",
            "Display": "tak trzymać",
            "Words": [
              {
                "Word": "tak",
                "Offset": 10954800000,
                "Duration": 1900000
              },
              {
                "Word": "trzymać",
                "Offset": 10956800000,
                "Duration": 5500000
              }
            ]
          },
          {
            "Confidence": 0.9367673,
            "Lexical": "czy nasza",
            "ITN": "czy nasza",
            "MaskedITN": "czy nasza",
            "Display": "czy nasza",
            "Words": [
              {
                "Word": "czy",
                "Offset": 10956600000,
                "Duration": 1300000
              },
              {
                "Word": "nasza",
                "Offset": 10958000000,
                "Duration": 4300000
              }
            ]
          },
          {
            "Confidence": 0.9367673,
            "Lexical": "trzynastu",
            "ITN": "trzynastu",
            "MaskedITN": "trzynastu",
            "Display": "trzynastu",
            "Words": [
              {
                "Word": "trzynastu",
                "Offset": 10956600000,
                "Duration": 5700000
              }
            ]
          }
        ]
      }
    }
  }

As you can see privOffset is different that Offset in privJson. Also first word in Words in NBest are sometimes the same as privOffset and sometimes the same as Offset in privJson.

Another question I have is about privText. What is the difference between privText, DisplayText in privJson and Display in NBests? In example above privText and displayText are empty as well as first NBest, but next items in NBest array are not empty.

Also why is confidence the same in every NBest item?

This example is not a single case, something is definitely off. Here is again link to full azure response log (prettiefied) and audio file: https://we.tl/t-mCL5A4nPg4

@KarolScibior
Copy link
Author

@glharper bumping last comment

@KarolScibior
Copy link
Author

@glharper @dargilco This is really serious issue, can you look into it please?

@dargilco
Copy link
Member

Reopening, clearing owner for triage

@KarolScibior
Copy link
Author

Hello, any news regarding this issue? @dargilco

@glharper
Copy link
Member

glharper commented Apr 1, 2022

@KarolScibior Please re-upload the file used to reproduce this, and I'll take a look.

@KarolScibior
Copy link
Author

Link to audio file: https://we.tl/t-KdKweEF6ej

Recognition language is pl-PL

@glharper
Copy link
Member

glharper commented Apr 4, 2022

@KarolScibior, we have a batch transcription API that is the intended solution for long-form transcriptions like this use case. The docs for that service are here

Answers to your questions below.

Hi, I think there is still a problem with Offsets in long transcriptions (language is polish). I have audio file that is 2,5h long and offsets are splitting at some moment. Here is example:
[...]
As you can see privOffset is different that Offset in privJson.

For Recognized results (where result.reason === ResultReason.RecognizedSpeech (3)), this is as intended. The offset in the privJson should be ignored, as it is the offset reported from the service, which doesn't keep track of how much time the current turn has been running. (The word level offsets in the JSON are correct, as that was the original fix for this issue.) Please use the privOffset, which should be exposed via the result.offset property.

Also first word in Words in NBest are sometimes the same as privOffset and sometimes the same as Offset in privJson.

I'm not seeing that above...is that happening where result.reason === ResultReason.RecognizedSpeech (3) ?

Another question I have is about privText. What is the difference between privText, DisplayText in privJson and Display in NBests?

For simple recognition results, the privText will the DisplayText in privJson.
For detailed recognition results, the privText will be the Display of the first NBest element.
The service does seem to often send back an empty first element of NBest with an empty string as the Display value, and I don't know why. That seems wrong.

In example above privText and displayText are empty as well as first NBest, but next items in NBest array are not empty.

I can add a workaround in JS where the first element of NBest is ignored if the Display field is empty, but this seems like incorrect service behavior, and thus a somewhat fragile fix.

Also why is confidence the same in every NBest item?

Again, that's a service issue.

@glharper glharper added pending close Ready for closure pending follow-up or prolonged inactivity and removed in review Acknowledged and being looked at now bug Something isn't working labels May 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending close Ready for closure pending follow-up or prolonged inactivity
Projects
None yet
3 participants