Skip to content
zverok edited this page Aug 7, 2015 · 5 revisions

Getting Everything Together

The overall process of information extraction from Wikipedia pages can look like this:

1. You should be familiar with Wikipedia markup, at least to the level of information on this page. The most information-rich parts are typically links, tables and templates.

2. You'll need to study page source/structure, not only rednered page outlook.

For example, consider extracting list of episodes of some great TV show. Just from the looking at page you'll guess it will be something about tables (because you see a table of episodes). But inspecting the source, you'll see each episode is described by template:

{{Episode list/sublist|List of Breaking Bad episodes
 |EpisodeNumber = 1
 |EpisodeNumber2 = 1
 |Title = [[Pilot (Breaking Bad)|Pilot]]
...

You can also inspect the structure with Infoboxer's help:

puts Infoboxer.wp.get('Breaking Bad (season 1)').
  sections('Episodes').to_tree
# Output:
# <Section>
# .....
#    <Template(Episode list/sublist)>
#      <TemplateVariable(1)>
#        List of Breaking Bad episodes <Text>
#      <TemplateVariable(EpisodeNumber)>
#        1 <Text>
#      <TemplateVariable(EpisodeNumber2)>
#        1 <Text>
#      <TemplateVariable(Title)>
#        Pilot <Wikilink(link: "Pilot (Breaking Bad)")>
# ......

After this introspection, the extraction algo is clear:

Infoboxer.wp.get('Breaking Bad (season 1)').
  sections('Episodes').templates(name: 'Episode table').
  fetch('episodes').templates(name: /^Episode list/).
  fetch_hashes('EpisodeNumber', 'EpisodeNumber2', 'Title', 'ShortSummary')
# => [{"EpisodeNumber"=>#<Var(EpisodeNumber): 1>, "EpisodeNumber2"=>#<Var(EpisodeNumber2): 1>, "Title"=>#<Var(Title): Pilot>, "ShortSummary"=>#<Var(ShortSummary): Walter White, a 50-year old che...>},
#     {"EpisodeNumber"=>#<Var(EpisodeNumber): 2>, "EpisodeNumber2"=>#<Var(EpisodeNumber2): 2>, "Title"=>#<Var(Title): Cat's in the Bag...>, "ShortSummary"=>#<Var(ShortSummary): Walt and Jesse try to dispose o...>},
#     ...and so on

3. Typically, pages of same type (episodes list of TV shows, or articles about countries, or about animal species, ....) are very similar. Though, to make reliable algo of extracting data, you'll need to examine several of a kind, because some differences are always there. But after 3-5 similar articles, your algo will be good enough to extract data from most of this kind articles. Yet there always would be some outliers and complex cases. Life is complicated. Wikipedia is complicated.

Useful tips

  • There's several "common ancestor" node classes, that are never instantiated, but useful for generic lookup:
    • lookup(:BaseParagraph) -- all "paragraph-level" nodes (paragraphs, headings, lists, ....);
    • lookup(:List) -- all ordered, unordered and definition lists;
    • lookup(:BaseCell) -- table plain cell and heading cell.