Skip to content
jirkle edited this page May 26, 2017 · 6 revisions

Welcome to the ErrCorp wiki!

Usage

--help, -h Prints help
--lang, -l Lang of processed input, compulsory option, one of [czech|english]
--separator, -s Separator char, default [;]
--robots, -r Flag: Include revisions made by bots
--nesting, -n Flag: If present, nesting of errors is allowed
--mute, -m Flag: Script informs only about current processed page, estimated time and corp stats
--paths, -p Local paths to database dumps
--dumpUrls, -d Remote paths to database dumps
--articleName, -a Article names to be downloaded through MediaWiki action API
--output, -o Output path
--outputFormat, -f Output format, one of [txt|se]

Algorithm description

The script is divided into several files. Input is solved in main.py and WikiDownload.py files, page processing in PageProcessor.py, error extraction in ErrorExtractor.py, post processing in PostProcessor.py and ErrorClassifier.py and finally, output addresses Export.py.

How it works

ErrCorp takes Wikipedia dump/page name and processes it page by page. During processing, it compares the content of every two adjacent revisions and gets unique sentences in older and newer revision. Then ErrCorp links each old sentence to best matching new sentence. One sentence could have been edited multiple times over multiple revisions so evolution resolution is started after extracting all sentence pairs. Then the script does the comparison for each older version of sentence against the newest version and extracts errors based on characters. These minimal changes are post-processed afterward, expanded to whole words, classified and annotated in the newest version of the sentence as errors.

Input

Page processor

Error extractor

Post processor

Working on it :)

Clone this wiki locally