New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transcript offsets #25
Conversation
Thanks for starting to work on this! This definitely seems like the path forward for addressing hyphenation and German compound words. (Are there other cases where this will make a substantive difference?) |
Yeah good point. This fixed the hyphenation stuff. It also preserves punctuation so you can make awesome ascii art in your transcripts. ★░░░░░░░░░░░████░░░░░░░░░░░░░░░░░░░░★ ★░░░░░░░░░███░██░░░░░░░░░░░░░░░░░░░░★ ★░░░░░░░░░██░░░█░░░░░░░░░░░░░░░░░░░░★ ★░░░░░░░░░██░░░██░░░░░░░░░░░░░░░░░░░★ ★░░░░░░░░░░██░░░███░░░░░░░░░░░░░░░░░★ ★░░░░░░░░░░░██░░░░██░░░░░░░░░░░░░░░░★ ★░░░░░░░░░░░██░░░░░███░░░░░░░░░░░░░░★ ★░░░░░░░░░░░░██░░░░░░██░░░░░░░░░░░░░★ ★░░░░░░░███████░░░░░░░██░░░░░░░░░░░░★ ★░░░░█████░░░░░░░░░░░░░░███░██░░░░░░★ ★░░░██░░░░░████░░░░░░░░░░██████░░░░░★ ★░░░██░░████░░███░░░░░░░░░░░░░██░░░░★ ★░░░██░░░░░░░░███░░░░░░░░░░░░░██░░░░★ ★░░░░██████████░███░░░░░░░░░░░██░░░░★ ★░░░░██░░░░░░░░████░░░░░░░░░░░██░░░░★ ★░░░░███████████░░██░░░░░░░░░░██░░░░★ ★░░░░░░██░░░░░░░████░░░░░██████░░░░░★ ★░░░░░░██████████░██░░░░███░██░░░░░░★ ★░░░░░░░░░██░░░░░████░███░░░░░░░░░░░★ ★░░░░░░░░░█████████████░░░░░░░░░░░░░★ ★░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░★ |
this does a better job of preserving whitespace and other things we want to filter out before feeding into Kaldi.
That's a bug with the offset calculation. Can you send me the transcript text? |
Here's the transcript text and the alignment JSON: |
Ug. It's Python's utf-8 handling. Working on it... |
OK, that should fix it. Please give this a try. This video on Unicode in Python in required watching. So much pain: |
In case you tried this in the last few minutes, please pull again. There was a bug, but the latest version should work for your file. |
Fabulous. |
Calculate transcript offsets for all tokens and use those to reconstruct the text in-browser. This begins to address #11.