-
Notifications
You must be signed in to change notification settings - Fork 0
Home
jirkle edited this page Feb 19, 2017
·
6 revisions
Welcome to the ErrCorp wiki!
##ErrCorp ErrCorp is tool for automated generation of error corpora from wikipedia dump.
It takes wiki dump/page name and processes it page by page. During processing it compares content of every two adjacent revisions and gets unique sentences in older and newer revision. Then ErrCorp links each old sentence to best matching new sentence and finally each of these matches are resolved as one type of error:
- Punctuation - old & new sentence without punctuation are the same
- Word order - old & new sentence have the same bag of words
- Typo - comment of rev contains predefined set of words (regex - typoFilter), typos are further extracted
- Edit - comment of rev contains predefined set of words (regex - editFilter)
- Other - all other, non classified errors