Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare/review PDF vs OCR output #41

Open
4 of 5 tasks
interrogator opened this issue May 20, 2020 · 1 comment
Open
4 of 5 tasks

Compare/review PDF vs OCR output #41

interrogator opened this issue May 20, 2020 · 1 comment
Labels
law-hist needed for law history demo

Comments

@interrogator
Copy link
Owner

interrogator commented May 20, 2020

Law history project is just one of many possible projects where the input data is the output of an OCR process.

For such corpora, ideally, buzzword provides an interface that displays the original PDF beside its plaintext. Users should be able to submit changes (i.e. corrections) and/or add metadata tags to the text. In terms of implementation, the text displayed on the right should use a good high-level lib for this (martor right now). Needed features:

  • Corpus model updated to handle a parallel 'PDF corpus'
  • the compare/<slug> endpoint, possibly with page/pages query params for starting at a certain place
  • Navigation between pages. Essentially (<<first <prev next> last>>) [ goto ] is good enough for now.
  • Admin interface for viewing/accepting submitted corrections
  • Email service to tell users the status of their changes/tell admins that there are new suggestions

This should be built with knowledge that the explorer and compare interfaces will eventually be linked via #40

This was referenced May 20, 2020
@interrogator interrogator added the law-hist needed for law history demo label May 20, 2020
@interrogator
Copy link
Owner Author

No email yet, but everything else done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
law-hist needed for law history demo
Projects
None yet
Development

No branches or pull requests

1 participant