Tutorial:Replacing a string for every page in a category

danmichaelo edited this page Nov 23, 2014 · 2 revisions
Clone this wiki locally

Suppose you have a category and wish to replace a given piece of text by a new, different piece of text in every page in that category. For example, suppose you were renaming the category "Greek presidants" to the correctly spelled "Greek presidents", and needed to replace [[Category:Greek presidants]] on every page in that category with [[Category:Greek presidents]]. The following code does this:

import mwclient
site = mwclient.Site(('https', 'en.wikipedia.org'))
site.login('username', 'password')
for page in site.Categories['Greek presidants']:
    print page.page_title
    text = page.text()
    text = text.replace('[[Category:Greek presidants]]', '[[Category:Greek presidents]]')
    page.save(text, summary='Renaming category Greek presidants to Greek presidents')

This basic text replacement suffices for many simple tasks. For more complex replacement tasks, Python regular expressions from the module "re" are useful. For example, suppose we also wanted to replace lowercase versions like [[Category:greek presidants]], or versions with the underscore [[Category:Greek_presidants]]. We add the line import re at the top and change the text.replace line to read:

 text = re.sub(r'(?i)\[\[Category:Greek[ _]presidants\]\]'), '[[Category:Greek presidents]]', text)  

The (?i) makes it case-insensitive. The [ and ] brackets must be escaped with \ inside a regular expression. The [ _] is a character set matching either space or underscore. Note that brackets are not escaped in the new text.

What if we want to replace text on every page in an entire category tree? For this we use a simple recursive function which calls itself on sub-categories, as in this example, which replaces all "cc-by-2.0" license tags with "cc0" in a given category tree on Commons:

import mwclient

def replace_in_category(category):
    print 'Replacing in category ' + category.[[Page.page_title|page_title]]
    for page in category:
        if page.namespace == 14:  # 14 is the category namespace
            replace_in_category(page)
        else:
            print page.page_title
            text = page.text()
            text = text.replace('{{cc-by-2.0}}', '{{cc0}}')
            page.save(text, summary='Replacing license tag ({{cc-by-2.0}} -> {{cc0}})')
    print 'Done with category ' + cat.page_title

site = mwclient.Site(('https', 'commons.wikimedia.org'))
site.login('username', 'password')
replace_in_cat(site.Categories['Root category'])

This page was originally imported from the old mwclient wiki at SourceForge. The imported version was dated from 00:59, 18 March 2012, and its only editor was Derrickcoetzee (@dcoetzee).