Skip to content
jirkle edited this page Feb 19, 2017 · 6 revisions

Welcome to the ErrCorp wiki!

##ErrCorp ErrCorp is tool for automated generation of error corpora from wikipedia dump.

How it works

It takes wiki dump/page name and processes it page by page. During processing it compares content of every two adjacent revisions and gets unique sentences in older and newer revision. Then ErrCorp links each old sentence to best matching new sentence and finally each of these matches are resolved as one type of error:

  • Punctuation - old & new sentence without punctuation are the same
  • Word order - old & new sentence have the same bag of words
  • Typo - comment of rev contains predefined set of words (regex - typoFilter), typos are further extracted
  • Edit - comment of rev contains predefined set of words (regex - editFilter)
  • Other - all other, non classified errors

Clone this wiki locally