Data model for transcripts #71

Frando · 2021-12-02T14:41:30Z

Currently the transcript is saved as JSON with the following structure:

{
   "parts": [
         { "word": "foo", "conf": 0.382, "start": 1.323, "end": 1.55322 }
    ]
}

Missing things:

If we continue to store the full transcript results, we need an optional field per part "suffix" to store the punctuation that comes afterwards
For elasticsearch indexing we already convert this format into a text-only form for the elasticsearch payload delemiter. The format is `"foo|start,end,conf bar|start,end,conf". This is much smaller. Possibly switch to this format for storage as well.
We should also define an outer data model for transcript to store meta information. Fields: engine, model, modelVersion, (and processingTime, averageConfidence, createdDate - can be autogenerated).

The text was updated successfully, but these errors were encountered:

Frando added this to To do in OAS - Development Dec 2, 2021

Provide feedback