Transcript offsets #25

maxhawkins · 2015-11-24T17:09:37Z

Calculate transcript offsets for all tokens and use those to reconstruct the text in-browser. This begins to address #11.

strob · 2015-11-24T18:35:41Z

Thanks for starting to work on this! This definitely seems like the path forward for addressing hyphenation and German compound words. (Are there other cases where this will make a substantive difference?)
Does this already fix hyphenation alignment? If not, I'd rather wait to merge until it adds that functionality.

maxhawkins · 2015-11-24T19:31:33Z

Yeah good point. This fixed the hyphenation stuff. It also preserves punctuation so you can make awesome ascii art in your transcripts.

★░░░░░░░░░░░████░░░░░░░░░░░░░░░░░░░░★ 
★░░░░░░░░░███░██░░░░░░░░░░░░░░░░░░░░★ 
★░░░░░░░░░██░░░█░░░░░░░░░░░░░░░░░░░░★ 
★░░░░░░░░░██░░░██░░░░░░░░░░░░░░░░░░░★ 
★░░░░░░░░░░██░░░███░░░░░░░░░░░░░░░░░★ 
★░░░░░░░░░░░██░░░░██░░░░░░░░░░░░░░░░★ 
★░░░░░░░░░░░██░░░░░███░░░░░░░░░░░░░░★ 
★░░░░░░░░░░░░██░░░░░░██░░░░░░░░░░░░░★ 
★░░░░░░░███████░░░░░░░██░░░░░░░░░░░░★ 
★░░░░█████░░░░░░░░░░░░░░███░██░░░░░░★ 
★░░░██░░░░░████░░░░░░░░░░██████░░░░░★ 
★░░░██░░████░░███░░░░░░░░░░░░░██░░░░★ 
★░░░██░░░░░░░░███░░░░░░░░░░░░░██░░░░★ 
★░░░░██████████░███░░░░░░░░░░░██░░░░★ 
★░░░░██░░░░░░░░████░░░░░░░░░░░██░░░░★ 
★░░░░███████████░░██░░░░░░░░░░██░░░░★ 
★░░░░░░██░░░░░░░████░░░░░██████░░░░░★ 
★░░░░░░██████████░██░░░░███░██░░░░░░★ 
★░░░░░░░░░██░░░░░████░███░░░░░░░░░░░★ 
★░░░░░░░░░█████████████░░░░░░░░░░░░░★ 
★░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░★

this does a better job of preserving whitespace and other things we want to filter out before feeding into Kaldi.

closes #11

strob · 2015-11-24T19:56:25Z

Getting there!

Unfortunately, the click/highlight behavior on view_alignment.html is behaving very poorly.

Here's a screenshot of the sort of thing I'm seeing (throughout a transcript)--let me know if you're able to reproduce or if you need a more elaborate bug report:

maxhawkins · 2015-11-24T20:04:05Z

That's a bug with the offset calculation. Can you send me the transcript text?

strob · 2015-11-24T20:06:33Z

Here's the transcript text and the alignment JSON:

wasteland.txt
wastleland-align.json.txt

maxhawkins · 2015-11-24T20:27:08Z

Ug. It's Python's utf-8 handling. Working on it...

maxhawkins · 2015-11-24T21:28:23Z

OK, that should fix it. Please give this a try.

This video on Unicode in Python in required watching. So much pain:
http://nedbatchelder.com/text/unipain.html

maxhawkins · 2015-11-24T22:07:25Z

In case you tried this in the last few minutes, please pull again. There was a bug, but the latest version should work for your file.

Transcript offsets

strob · 2015-11-24T22:45:11Z

Fabulous.

maxhawkins added 7 commits November 24, 2015 20:52

output text offsets with alignment

91414c5

this does a better job of preserving whitespace and other things we want to filter out before feeding into Kaldi.

remove unused align_from_kaldi function

130e89f

perf: only update DOM when the word changes

49d34f7

serve.py: text from transcript not tokens

0839d07

tokenization: split hyphenated words and ignore punctuation

c23c3cb

closes #11

update golden master

26d7e0f

metasentence cleanup

a23b7b5

maxhawkins added 2 commits November 24, 2015 22:26

offsets as unicode codepoints

920d42b

more tests

5060f8b

keep raw sentence as unicode (fixes crash)

cf3ca57

strob added a commit that referenced this pull request Nov 24, 2015

Merge pull request #25 from maxhawkins/text_offsets

43b0b31

Transcript offsets

strob merged commit 43b0b31 into lowerquality:master Nov 24, 2015

maxhawkins deleted the text_offsets branch November 24, 2015 22:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transcript offsets #25

Transcript offsets #25

maxhawkins commented Nov 24, 2015

strob commented Nov 24, 2015

maxhawkins commented Nov 24, 2015

strob commented Nov 24, 2015

maxhawkins commented Nov 24, 2015

strob commented Nov 24, 2015

maxhawkins commented Nov 24, 2015

maxhawkins commented Nov 24, 2015

maxhawkins commented Nov 24, 2015

strob commented Nov 24, 2015

Transcript offsets #25

Transcript offsets #25

Conversation

maxhawkins commented Nov 24, 2015

strob commented Nov 24, 2015

maxhawkins commented Nov 24, 2015

strob commented Nov 24, 2015

maxhawkins commented Nov 24, 2015

strob commented Nov 24, 2015

maxhawkins commented Nov 24, 2015

maxhawkins commented Nov 24, 2015

maxhawkins commented Nov 24, 2015

strob commented Nov 24, 2015