Skip to content
Simple extension of WikiExtractor(https://github.com/attardi/wikiextractor)
Branch: master
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore initial commit Oct 14, 2016
ChangeLog initial commit Oct 14, 2016
LICENSE initial commit Oct 14, 2016
README.md Add example command for WikiExtractor.py Dec 23, 2016
To_the_one_text.py initial commit Oct 14, 2016
WikiExtractor.py initial commit Oct 14, 2016
cirrus-extract.py initial commit Oct 14, 2016
extractPage.py initial commit Oct 14, 2016
setup.py initial commit Oct 14, 2016
tests.py initial commit Oct 14, 2016
tox.ini initial commit Oct 14, 2016

README.md

WikiExtractor_To_the_one_text

Simple extension for Python script that extracts and cleans text from a Wikipedia database dump. Most of the codes are from WikiExtrator

##Installation

(sudo) python setup.py install

Usage

python WikiExtractor.py Wiki_dump.xml -options

ex) python WikiExtractor.py enwiki-latest-pages-articles.xml -b 500K -o extracted

For detailed options, see WikiExtrator

python To_the_one_text.py Input_directory Name_of_the_single_output_file

You can’t perform that action at this time.