Site.allpages

Waldir Pimenta edited this page Jun 1, 2014 · 1 revision
Clone this wiki locally

Site.allpages() returns an iterator which allows iteration over all pages in a given namespace. It also allows querying for pages based on title prefix, article size, protection level, whether the page is a redirect, and whether language links are present, and allows enumerating pages in sorted order by name. Typical usage uses a for loop:

site = mwclient.Site('en.wikipedia.org')
for page in site.allpages():
    print page.name

Because it uses an iterator, allpages() begins to yield results quickly even on very large wikis, and retrieves more pages during iteration.

Corresponding API: Allpages

Parameters

  • (optional) start: Lists only pages whose names are >= start (or if dir = 'descending' is given, <= start). (default: None)
    • Note: results are not sorted unless generator = False. Even if the article with name "start" exists it may not appear first unless generator = False.
  • (optional) prefix: Lists only pages starting with the given prefix. (default: None)
  • (optional) namespace: Namespace index of the namespace to list pages from. Default is article namespace. (default: 0)
  • (optional) filterredir: 'redirects' to list only redirects, 'nonredirects' to omit redirects. (default: 'all')
  • (optional) minsize: Lists only pages with wikitext of at least this many bytes in size. (default: None)
  • (optional) maxsize: Lists only pages with wikitext of at most this many bytes in size. (default: None)
  • (optional) prtype: 'edit' lists only edit-protected pages, 'move' lists only move-protected pages. (default: None)
  • (optional) prlevel: Lists only pages protected at a given user level, default options are 'autoconfirmed', 'sysop'. (default: None)
    • If prlevel is given, prtype must also be given.
  • (optional) limit: Indicates the number of pages to retrieve at one time. Default is maximum permitted by the site. (default: None)
    • Note that this does not limit the number of results returned. To do that, use a counter and break out of the loop.
  • (optional) dir: alters interpretation of the start parameter. (default: 'ascending')
  • (optional) filterlanglinks: 'withlanglinks' lists only pages with language links. 'withoutlanglinks' lists only pages without language links. (default: 'all')
    • Note that unless filterredir = 'nonredirects' is given, a query with 'withoutlanglinks' is likely to also contain redirects.
  • (optional) generator: If True, the returned iterator yields a list of Page objects in arbitrary order. If False, it yields a list of page names (strings) in sorted order. (default: True)
    • The name "generator" is a reference to API generators, not to Python generators.

Result

If generator is True (default), returns a GeneratorList that can be iterated over and yields a Page object for each page matching the given conditions (in arbitary order). If generator is False, returns a List that can be iterated over and yields a string for each page specifying the page name (in sorted order).

Errors

Errors that may be produced by the allpages() call itself:

For errors that may be produced when iterating over the resulting iterator, see GeneratorList.next and List.next.

Examples

This example iterates over one of two allpages() lists depending on a conditional. The allpages() calls return immediately, and results are only loaded once the loop begins:

if apples_only:
    pages = site.allpages(prefix='Apple')
else:
    pages = site.allpages()
for page in pages:
    print page.name

This fragment lists all articles starting with 'Apple' in sorted order by name. It uses generator = False to ensure results are in sorted order, but this also means we must use page rather than page.name to refer to the page name:

for page in site.allpages(prefix='Apple', generator=False):
    print page

This fragment lists the first 100 articles starting at M with wikitext of length between 50 and 200 bytes (note that limit = 100 would not work here):

count = 0
for page in site.allpages(start='M', minsize=50, maxsize=200):
    print page.name
    count += 1
    if count == 100:
        break

This fragment lists all blank user pages:

for page in site.allpages(namespace=2, maxsize=0):
    print page.name

This fragment lists the names of all articles starting with "Apple" which are not redirects and don't have language links:

for page in site.allpages(prefix='Apple', filterredir='nonredirects', filterlanglinks='withoutlanglinks'):
    print page.name

This fragment lists all semi-protected articles, where semi-protected means only autoconfirmed users are allowed to edit them:

for page in site.allpages(prtype='edit', prlevel='autoconfirmed'):
    print page.name

This fragment lists the last article in sorted order starting with T:

pages = site.allpages(prefix='T', dir='descending', limit=1)
print pages.next().name

Notes

On very large wikis with millions of pages, such as the English Wikipedia, iterating over all pages with no restricting parameters would take hours and consume extensive server resources, and so is inadvisable without prior permission.

If having up-to-date information is not essential, a more efficient alternative is to create an XML dump periodically and process it locally. Wikimedia Foundation projects like Wikipedia supply database dumps periodically at http://dumps.wikimedia.org/. Another more efficient alternative in many cases is a direct SQL query against the Mediawiki database (or a replica). For Mediawiki Foundation projects such a service is provided by Toolserver.

Although mwclient only retrieves pages in sorted order when generator = False is passed, reordering can only occur within a single query. So, if limit is 100, the second 100 results will follow the first 100 results in sorted order. This means if limit is 1, results will be in sorted order even if generator = True, but this is very inefficient, performing an API query per page.


This page was originally imported from the old mwclient wiki at SourceForge. The imported version was dated from 00:11, 18 March 2012, and its only editor was Derrickcoetzee (@dcoetzee).