Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transcript offsets #25

Merged
merged 10 commits into from Nov 24, 2015
Merged

Transcript offsets #25

merged 10 commits into from Nov 24, 2015

Conversation

maxhawkins
Copy link
Contributor

Calculate transcript offsets for all tokens and use those to reconstruct the text in-browser. This begins to address #11.

@strob
Copy link
Contributor

strob commented Nov 24, 2015

Thanks for starting to work on this! This definitely seems like the path forward for addressing hyphenation and German compound words. (Are there other cases where this will make a substantive difference?)
Does this already fix hyphenation alignment? If not, I'd rather wait to merge until it adds that functionality.

@maxhawkins
Copy link
Contributor Author

Yeah good point. This fixed the hyphenation stuff. It also preserves punctuation so you can make awesome ascii art in your transcripts.

★░░░░░░░░░░░████░░░░░░░░░░░░░░░░░░░░★ 
★░░░░░░░░░███░██░░░░░░░░░░░░░░░░░░░░★ 
★░░░░░░░░░██░░░█░░░░░░░░░░░░░░░░░░░░★ 
★░░░░░░░░░██░░░██░░░░░░░░░░░░░░░░░░░★ 
★░░░░░░░░░░██░░░███░░░░░░░░░░░░░░░░░★ 
★░░░░░░░░░░░██░░░░██░░░░░░░░░░░░░░░░★ 
★░░░░░░░░░░░██░░░░░███░░░░░░░░░░░░░░★ 
★░░░░░░░░░░░░██░░░░░░██░░░░░░░░░░░░░★ 
★░░░░░░░███████░░░░░░░██░░░░░░░░░░░░★ 
★░░░░█████░░░░░░░░░░░░░░███░██░░░░░░★ 
★░░░██░░░░░████░░░░░░░░░░██████░░░░░★ 
★░░░██░░████░░███░░░░░░░░░░░░░██░░░░★ 
★░░░██░░░░░░░░███░░░░░░░░░░░░░██░░░░★ 
★░░░░██████████░███░░░░░░░░░░░██░░░░★ 
★░░░░██░░░░░░░░████░░░░░░░░░░░██░░░░★ 
★░░░░███████████░░██░░░░░░░░░░██░░░░★ 
★░░░░░░██░░░░░░░████░░░░░██████░░░░░★ 
★░░░░░░██████████░██░░░░███░██░░░░░░★ 
★░░░░░░░░░██░░░░░████░███░░░░░░░░░░░★ 
★░░░░░░░░░█████████████░░░░░░░░░░░░░★ 
★░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░★

@strob
Copy link
Contributor

strob commented Nov 24, 2015

Getting there!

Unfortunately, the click/highlight behavior on view_alignment.html is behaving very poorly.

Here's a screenshot of the sort of thing I'm seeing (throughout a transcript)--let me know if you're able to reproduce or if you need a more elaborate bug report:

screen shot 2015-11-24 at 11 55 33 am

@maxhawkins
Copy link
Contributor Author

That's a bug with the offset calculation. Can you send me the transcript text?

@strob
Copy link
Contributor

strob commented Nov 24, 2015

Here's the transcript text and the alignment JSON:

wasteland.txt
wastleland-align.json.txt

@maxhawkins
Copy link
Contributor Author

Ug. It's Python's utf-8 handling. Working on it...

@maxhawkins
Copy link
Contributor Author

OK, that should fix it. Please give this a try.

This video on Unicode in Python in required watching. So much pain:
http://nedbatchelder.com/text/unipain.html

@maxhawkins
Copy link
Contributor Author

In case you tried this in the last few minutes, please pull again. There was a bug, but the latest version should work for your file.

strob added a commit that referenced this pull request Nov 24, 2015
@strob strob merged commit 43b0b31 into lowerquality:master Nov 24, 2015
@strob
Copy link
Contributor

strob commented Nov 24, 2015

Fabulous.

@maxhawkins maxhawkins deleted the text_offsets branch November 24, 2015 22:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants