Consider:
- the egyptian hieroglyphics syntax
- 'Birth_date_and_age' vs 'Birth-date_and_age'.
- the partial-implementation of inline-css,
- deep recursion of similar-syntax templates,
- the unexplained hashing scheme for image paths,
- the custom encoding of whitespace and punctuation,
- right-to-left values in left-to-right templates.
- as of Nov-2018, there are 634,755 templates in wikipedia
wtf_wikipedia supports many recursive shenanigans, depreciated and obscure template variants, and illicit 'wiki-esque' shorthands.
npm install wtf_wikipedia
var wtf = require('wtf_wikipedia');
wtf.fetch('Whistling').then(doc => {
doc.categories();
//['Oral communication', 'Vocal music', 'Vocal skills']
doc.sections('As communication').text();
// 'A traditional whistled language named Silbo Gomero..'
doc.images(0).thumb();
// 'https://upload.wikimedia.org..../300px-Duveneck_Whistling_Boy.jpg'
doc.sections('See Also').links().map(link => link.page)
//['Slide whistle', 'Hand flute', 'Bird vocalization'...]
});
on the client-side:
<script src="https://unpkg.com/wtf_wikipedia"></script>
<script>
//(follows redirect)
wtf.fetch('On a Friday', 'en', function(err, doc) {
var val = doc.infobox(0).get('current_members');
val.links().map(link => link.page);
//['Thom Yorke', 'Jonny Greenwood', 'Colin Greenwood'...]
});
</script>
- Detects and parses redirects and disambiguation pages
- Parse infoboxes into a formatted key-value object
- Handles recursive templates and links- like [[.. [[...]] ]]
- Per-sentence plaintext and link resolution
- Parse and format internal links
- creates image thumbnail urls from File:XYZ.png filenames
- Properly resolve {{CURRENTMONTH}} and {{CONVERT ..}} type templates
- Parse images, headings, and categories
- converts 'DMS-formatted' (59°12'7.7"N) geo-coordinates to lat/lng
- parses citation metadata
- Eliminate xml, latex, css, and table-sorting cruft
Wikimedia's Parsoid javascript parser is the official wikiscript parser, and is pretty cool. It reliably turns wikiscript into HTML, but not valid XML.
To use it for data-mining, you'll need to:
parsoid(wikiText) -> [headless/pretend-DOM] -> screen-scraping
which is fine,
but getting structured data this way (say, sentences or infobox values), is still a complex + weird process. Arguably, you're not any closer than you were with wikitext. This library has lovingly ❤️ borrowed a lot of code and data from the parsoid project, and thanks its contributors.
wtf_wikipedia was built to work with dumpster-dive, which lets you parse a whole wikipedia dump on a laptop in a couple hours. It's definitely the way to go, instead of fetching many pages off the api.
const wtf = require('wtf_wikipedia')
//parse a page
var doc = wtf(wikiText, [options])
//fetch & parse a page - wtf.fetch(title, [lang_or_wikiid], [options], [callback])
(async () => {
var doc = await wtf.fetch('Toronto');
console.log(doc.text())
})();
//(callback format works too)
wtf.fetch(64646, 'en', (err, doc) => {
console.log(doc.categories());
});
//get a random german page
wtf.random('de').then(doc => {
console.log(doc.text())
});
Document - the whole thing
- Category
- Coordinate
Section - page headings ( ==these== )
- Infobox - a main, key-value template
- Table -
- Reference - citations, all-forms
- Template - any other structured-data
Paragraph - content separated by two newlines
- Image -
- List - a series of bullet-points
Sentence - contains links, formatting, dates
For the most-part, these classes do the looping-around for you, so that Document.links()
will go through every section, paragraph, and sentence, to get their links.
Broadly speaking, you can ask for the data you'd like:
- .sections() - ==these things==
- .sentences()
- .paragraphs()
- .links()
- .tables()
- .lists()
- .images()
- .templates() - {{these|things}}
- .categories()
- .citations() - <ref>these guys</ref>
- .infoboxes()
- .coordinates()
or output things in various formats:
- .json() - handy, workable data
- .text() - reader-focused plaintext
- .html()
- .markdown()
- .latex() - (ftw)
- .isRedirect() - boolean
- .isDisambiguation() - boolean
- .title() - guess the title of this page
- .redirectsTo() - {page:'China', anchor:'#History'}
flip your wikimedia markup into a Document
object
import wtf from 'wtf_wikipedia'
wtf(`==In Popular Culture==
* harry potter's wand
* the simpsons fence`);
// Document {text(), html(), lists()...}
retrieves raw contents of a mediawiki article from the wikipedia action API.
This method supports the errback callback form, or returns a Promise if one is missing.
to call non-english wikipedia apis, add it's language-name as the second parameter
wtf.fetch('Toronto', 'de', function(err, doc) {
doc.text();
//Toronto ist mit 2,6 Millionen Einwohnern..
});
you may also pass the wikipedia page id as parameter instead of the page title:
wtf.fetch(64646, 'de').then(console.log).catch(console.log)
the fetch method follows redirects.
the optional-callback pattern is the same for wtf.random()
wtf.random(lang, options, callback)
wtf.random(lang, options).then(doc=>doc.infobox())
retrieves all pages and sub-categories belonging to a given category:
let result = await wtf.category('Category:Politicians_from_Paris');
//{
// pages: [{title: 'Paul Bacon', pageid: 1266127 }, ...],
// categories: [ {title: 'Category:Mayors of Paris' } ]
//}
//this format works too
wtf.category('National Basketball Association teams', 'en', (err, result)=>{
//
});
returns only nice plain-text of the article
var wiki =
"[[Greater_Boston|Boston]]'s [[Fenway_Park|baseball field]] has a {{convert|37|ft}} wall.<ref>{{cite web|blah}}</ref>";
var text = wtf(wiki).text();
//"Boston's baseball field has a 37ft wall."
wtf(page).sections(1).children()
wtf(page).sections('see also').remove()
s = wtf(page).sentences(4)
s.links()
s.bolds()
s.italics()
s.dates() //structured date templates
img = wtf(page).images(0)
img.url() // the full-size wikimedia-hosted url
img.thumnail() // 300px, by default
img.format() // jpg, png, ..
img.exists() // HEAD req to see if the file is alive
if you're scripting this from the shell, or from another language, install with a -g
, and then run:
$ wtf_wikipedia George Clooney --plaintext
# George Timothy Clooney (born May 6, 1961) is an American actor ...
$ wtf_wikipedia Toronto Blue Jays --json
# {text:[...], infobox:{}, categories:[...], images:[] }
The wikipedia api is pretty welcoming though recommends three things, if you're going to hit it heavily -
- pass a
Api-User-Agent
as something so they can use to easily throttle bad scripts - bundle multiple pages into one request as an array
- run it serially, or at least, slowly.
wtf.fetch(['Royal Cinema', 'Aldous Huxley'], 'en', {
'Api-User-Agent': 'spencermountain@gmail.com'
}).then((docList) => {
let allLinks = docList.map(doc => doc.links());
console.log(allLinks);
});
Join in! - projects like these are only done with many-hands, and we try to be friendly and easy. PRs always welcome.
Some Big Wins:
- Supporting more templates - This is actually kinda fun.
- Adding more tests - you won't believe how helpful this is.
- Make a cool thing. Holler it at spencer.
if it's a big change, make an issue and talk-it-over first.
Otherwise, go nuts!
Thank you to the cross-fetch library.
MIT