Wikipedia information extraction library
Ruby
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
bin
examples
lib
profile
regression/pages
spec
.codeclimate.yml
.dokaz
.gitignore
.rspec
.rubocop.yml Rubocop upd Feb 8, 2018
.rubocop_todo.yml
.travis.yml
.yardopts
CHANGELOG.md Prepare the version Feb 9, 2018
CONTRIBUTING.md
Gemfile
LICENSE.txt
Parsing.md
README.md
Rakefile Better Travis Sep 10, 2017
infoboxer.gemspec

README.md

Infoboxer

Gem Version Build Status Coverage Status Code Climate Infoboxer Gitter

Infoboxer is pure-Ruby Wikipedia (and generic MediaWiki) client and parser, targeting information extraction (hence the name).

It can be useful in tasks like:

  • get a plaintext abstract of an article (paragraphs before first heading);
  • get structured data variables from page's infobox;
  • list page's sections and count paragraphs, images and tables in them;
  • convert some huge "comparison table" to data;
  • and much, much more!

The whole idea is: you can have any Wikipedia page as a parsed tree with obvious structure, you can navigate that tree easily, and you have a bunch of hi-level helpers method, so typical information extraction tasks should be super-easy, one-liners in best cases.

(For those already thinking "Why should you do this, we already have DBPedia?" -- please, read "Reasons" page in our wiki.)

Showcase

Infoboxer.wikipedia.
  get('Breaking Bad (season 1)').
  sections('Episodes').templates(name: 'Episode table').
  fetch('episodes').templates(name: /^Episode list/).
  fetch_hashes('EpisodeNumber', 'EpisodeNumber2', 'Title', 'ShortSummary')
# => [{"EpisodeNumber"=>#<Var(EpisodeNumber): 1>, "EpisodeNumber2"=>#<Var(EpisodeNumber2): 1>, "Title"=>#<Var(Title): Pilot>, "ShortSummary"=>#<Var(ShortSummary): Walter White, a 50-year old che...>},
#     {"EpisodeNumber"=>#<Var(EpisodeNumber): 2>, "EpisodeNumber2"=>#<Var(EpisodeNumber2): 2>, "Title"=>#<Var(Title): Cat's in the Bag...>, "ShortSummary"=>#<Var(ShortSummary): Walt and Jesse try to dispose o...>},
#     ...and so on

Do you feel it now?

You also can take a look at Showcase.

Usage

Install gem

Install it as usual: gem 'infoboxer' in your Gemfile, then bundle install.

Or just [sudo] gem install infoboxer if you prefer.

Grab the page

# From English Wikipedia
page = Infoboxer.wikipedia.get('Argentina')
# or
page = Infoboxer.wp.get('Argentina')

# From other language Wikipedia:
page = Infoboxer.wikipedia('fr').get('Argentina')

# From any wiki with the same engine:
page = Infoboxer.wiki('http://companywiki.com').get('Our Product')

See more examples and options at Retrieving pages

Play with page

Basically, page is a tree of Nodes, you can think of it as some kind of DOM.

So, you can navigate it:

# Simple traversing and inspect
node = page.children.first.children.first
node.to_tree
node.to_text

# Various lookups
page.lookup(:Template, name: /^Infobox/)

See Tree navigation basics.

On the top of the basic navigation Infoboxer adds some useful shortcuts for convenience and brevity, which allows things like this:

page.section('Episodes').tables.first

See Navigation shortcuts

To put it all in one piece, also take a look at Data extraction tips and tricks.

infoboxer executable

Just try infoboxer command.

Without any options, it starts IRB session with infoboxer required and included into main namespace.

With -w option, it provides a shortcut to MediaWiki instance you want. Like this:

$ infoboxer -w https://en.wikipedia.org/w/api.php
> get('Argentina')
 => #<Page(title: "Argentina", url: "https://en.wikipedia.org/wiki/Argentina"): ....

You can also use shortcuts like infoboxer -w wikipedia for common wikies (and, just for fun, infoboxer -wikipedia also).

Advanced topics

  • Reasons for Infoboxer creation;
  • Parsing quality (TL;DR: very good, but not ideal);
  • Performance (TL;DR: 0.1-0.4 sec for parsing hugest pages);
  • Localization (TL;DR: For now, you'll need some work to use Infoboxer's most advanced features with non-English or non-WikiMedia wikis; basic and mid-level features work always);
  • If you plan to use Wikipedia or sister projects data in production, please consider Wikipedia terms and conditions.

Compatibility

As of now, Infoboxer reported to be compatible with any MRI Ruby since 2.0.0 (1.9.3 previously, dropped since Infoboxer 0.2.0). In Travis-CI tests, JRuby is failing due to bug in old Java 7/Java 8 SSL certificate support (see here), and Rubinius failing 3 specs of 500 by mystery, which is uninvestigated yet.

Therefore, those Ruby versions are excluded from Travis config, though, they may still work for you.

Links

License

MIT.