Skip to content

[READ-ONLY] A word extractor for Wikipedia articles.

License

Notifications You must be signed in to change notification settings

jamesponddotco/wikiextract

Repository files navigation

wikiextract

builds.sr.ht status

wikiextract is a word extractor for Wikipedia articles. It can extract words bigger than 4 characters from a given Wikipedia page or list of pages and save them to a file you can later use as the source for generating diceware passwords.

Installation

From source

First install the dependencies:

  • Go 1.22 or above.
  • make.
  • scdoc.

Switch to the latest stable tag, v1.0.0, then compile and install:

git checkout v1.0.0
make
sudo make install

Usage

$ wikiextract --help
NAME:
   wikiextract - a simple word extractor for Wikipedia articles

USAGE:
   wikiextract [global options] 

VERSION:
   1.0.0

GLOBAL OPTIONS:
   --input-url value, -u value [ --input-url value, -u value ]  the URL of the Wikipedia page
   --input-file value, -f value                                 a file containing a list of URLs
   --output value, -o value                                     the path to the output file
   --help, -h                                                   show help
   --version, -v                                                print the version

$ wikiextract -u 'https://en.wikipedia.org/wiki/Wikipedia' -o 'output.txt'

See wikiextract(1) after installing for more information.

Contributing

Anyone can help make wikiextract better. Send patches on the mailing list and report bugs on the issue tracker.

You must sign-off your work using git commit --signoff. Follow the Linux kernel developer's certificate of origin for more details.

All contributions are made under the GPL-2.0 license.

Resources

The following resources are available:


Released under the GPL-2.0 license.