Tutorial:Querying Wikipedia with mwclient

Waldir Pimenta edited this page Jun 1, 2014 · 1 revision
Clone this wiki locally

mwclient is a library for accessing the MediaWiki API from Python. MediaWiki powers Wikipedia and a bunch of other wikis. In this quick guide, we'll look at how we can use mwclient to query any MediaWiki-powered site for the information we want. Installing mwclient

After installing mwclient (see the README), launch Python and confirm it's installed:

>>> import mwclient

If that didn't raise any errors, congratulations! You're all set to go.

Here's how you connect to Wikipedia and ask for revisions of the Wikipedia:Sandbox page:

>>> import mwclient
>>> from pprint import pprint
>>> site = mwclient.Site('en.wikipedia.org')
>>> page = site.Pages['Wikipedia:Sandbox']
>>> revisions = page.revisions()
>>> for counter in range(5):
...     rev = revisions.next()
...     pprint(rev)
... 
{'revid': 290932490,
 'timestamp': (2009, 5, 19, 12, 43, 13, 1, 139, -1),
 'user': 'Benlisquare'}
{'anon': '',
 'revid': 290930263,
 'timestamp': (2009, 5, 19, 12, 29, 23, 1, 139, -1),
 'user': '62.254.235.147'}
{'anon': '',
 'revid': 290930082,
 'timestamp': (2009, 5, 19, 12, 28, 16, 1, 139, -1),
 'user': '166.216.160.16'}
{'comment': 'Clearing the sandbox ([[WP:BOT|BOT]] EDIT)',
 'revid': 290927544,
 'timestamp': (2009, 5, 19, 12, 10, 6, 1, 139, -1),
 'user': 'SoxBot'}
{'anon': '',
 'revid': 290927187,
 'timestamp': (2009, 5, 19, 12, 7, 29, 1, 139, -1),
 'user': '62.254.235.147'}

Compare the output you get with the page’s revision history on Wikipedia. They should match.

Calling page.revisions() gives us a generator that returns revisions in reverse chronological order, with the most recent edit first. Each revision is a dictionary containing the keys you see above. The optional anon key indicates an anonymous edit; user then contains the editor's IP address instead of user name. All keys and string values will be Unicode strings.

To get all edits between two dates in the forward direction, with the text content of each revision, do this:

>>> revisions = page.revisions(start='2009-05-19T00:00:00Z',
...                            end='2009-05-19T23:59:59Z',
...                            dir='newer',
...                            prop='ids|timestamp|flags|comment|user|content')

And here's how to get all the edits of any given user. Let's look at SoxBot from the revisions above:

>>> contribs = site.usercontributions(u'SoxBot')
>>> for counter in range(2):
...     rev = contribs.next()
...     pprint(rev)
... 
{'comment': 'Delivering Vol. 5, Issue 20 of Wikipedia Signpost ([[User:SoxBot|BOT]])',
 'ns': 3,
 'pageid': 17244650,
 'revid': 290942689,
 'timestamp': (2009, 5, 19, 13, 44, 26, 1, 139, -1),
 'title': 'User talk:Twinzor',
 'top': '',
 'user': 'SoxBot'}
{'comment': 'Delivering Vol. 5, Issue 20 of Wikipedia Signpost ([[User:SoxBot|BOT]])',
 'ns': 3,
 'pageid': 21352732,
 'revid': 290942678,
 'timestamp': (2009, 5, 19, 13, 44, 23, 1, 139, -1),
 'title': 'User talk:Turco85',
 'top': '',
 'user': 'SoxBot'}

Notes

  1. MediaWiki timestamp strings can be generated using "%Y-%m-%dT%H:%M:%SZ" as format string with Python's datetime.strftime. All timestamps must be in UTC.

  2. You can pass a combination of parameters to page.revisions() to get revisions the way you want them. You can even skip the dates and call with startid or endid = any revision number (see revid in the output), to retrieve revisions before or after that one.

  3. To look at what parameters the page.revisions() and site.usercontributions() functions take, use Python's built-in help browser:

>>> help(page.revisions)
>>> help(site.usercontributions)

This page was originally written by Kiran Jonnalagadda (@jace) on his blog, at 20:03, 19 May 2009.