Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data model for transcripts #71

Open
3 tasks
Frando opened this issue Dec 2, 2021 · 0 comments
Open
3 tasks

Data model for transcripts #71

Frando opened this issue Dec 2, 2021 · 0 comments

Comments

@Frando
Copy link
Member

Frando commented Dec 2, 2021

Currently the transcript is saved as JSON with the following structure:

{
   "parts": [
         { "word": "foo", "conf": 0.382, "start": 1.323, "end": 1.55322 }
    ]
}

Missing things:

  • If we continue to store the full transcript results, we need an optional field per part "suffix" to store the punctuation that comes afterwards
  • For elasticsearch indexing we already convert this format into a text-only form for the elasticsearch payload delemiter. The format is `"foo|start,end,conf bar|start,end,conf". This is much smaller. Possibly switch to this format for storage as well.
  • We should also define an outer data model for transcript to store meta information. Fields: engine, model, modelVersion, (and processingTime, averageConfidence, createdDate - can be autogenerated).
@Frando Frando added this to To do in OAS - Development Dec 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

1 participant